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sThls  program  emphasizes  practical  hardware  objectives  In  LSI  form  for  Imple- 
menting digital  filters  as  a general  class  of  applications  and  the  fFT 
algorithm  as  a specific  application.  Three  LSI  chip  designs  were  completed, 
the  SPAU  for  signal  processing  arithmetic  unit,  the  SPDL  for  signal  processing 
delay  line,  and  the  SPAC  for  signal  processing  address  control.  The  most 
comprehensive  design,  the  SPAU,  went  through  two  design  Iterations  and  the 
final  version  has  achieved  a high  degree  of  acceptance  for  general  filter  
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-^processing  applications.  Twenty-five  chips  each  of  the  SPAU  and  SPAC  designs 
were  delivered.  Fifty  units  of  the  SPDL  were  delivered  as  well  as  hardware 
samples  of  work  In  progress  from  tlme-to-tlme.  Insofar  as  practical,  uni- 
versal designs  were  sought  and  believed  achieved,  particularly  In  the  SPAU. 
Based  on  comparative  studies  of  these  Implementations  and  SSI/MSI/LSI  altern- 
atives, a large  saving  In  board  space,  power,  and  Interconnections  results 
from  using  these  generic  type  LSI  chips  developed  by  this  program. 
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SECTION  I 

TECHNICAL  DISCUSSION 


1.1  Purpose 


The  purpose  of  the  LSI  Implementation  program  was  to  design  and  fabricate 
LSI  Chip  types  which  are  practical  In  reducing  the  cost  of  Implementing  digital 
filter  applications  In  DOD  equipment.  Specific  objectives  are  to  Implement 
Fourier  Transform  methods. 

1 . I Development  Progress  Brief 

This  work  started  out  examining  DOD  equipment  for  digital  filter  applica- 
tions and  centered  on  digital  Fourier  Transform  methods.  It  was  found  that 
Fourier  Transform  methods  are  widely  practiced  for  new  designs  and  considered 
as  a growth  application  for  the  general  technology  trend  away  from  analog  to 
digital  equipment. 

Various  exploratory  methods  of  architecture  were  under  review  for  the  LSI 
Implementation.  Of  the  three  general  methods  for  the  architecture  of  the  FFT 
Systems,  l.e.,  serial,  parallel -Iterative,  and  array,  the  parallel -iterative 
was  the  only  method  used  on  the  applications  examined.  We  therefore  accepted 
this  as  a rallying  point  for  starting  this  investigation. 

Serial  methods  might  be  attractive  for  LSI  because  of  the  reduced  I/O 
lead  requirements,  but  on  the  other  hand,  very  high  chip  to  chip  data  rates 
would  then  have  to  be  used.  In  keeping  within  the  data  rate  capability  of 
existing  MSI,  the  new  LSI  should  be  compatible  since  any  practical  system 
cannot  all  be  new  LSI,  but  will  continue  to  have  reliance  on  existing  support- 
ing hardware.  This  fact  rules  out  high  speed  serial  methods  that  would  have 
to  be  used  to  supplant  parallel  processing  as  now  used.  The  large  arrey  pro- 
cessor is  at  the  other  extreme  of  required  I/O  and  represents  overkill  for 
the  majority  of  applications.  The  parallel -iterative  approach  is  the  obvious 
choice  for  implementing  essentially  all  applications,  from  parallel-iterative 
to  full  parallel  array,  except  the  strictly  serial  processor.  Even  with  the 
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latter,  the  Introduction  of  serial  to  parallel  I/O  transformation  can  still 
retain  most  of  the  LSI  advantages. 


The  final  result  for  the  principal  arithmetic  unit  LSI  went  through  two 
major  iterations  and  achieved  both  a high  degree  of  simplicity  and  adaptability 
to  general  digital  filter  problems.  This  is  the  SPAU  2 which  has  a parallel 
multiplier-accumulator  organization.  Two  other  chips  were  also  made  LSI  for 
implementing  the  addressing  of  input  data  for  FFT  solution.  These  were  not 
general  purpose  but  special  purpose  LSI  which  were  found  to  have  considerable 
weight  on  reducing  the  cost  of  an  FFT  system  through  their  MSI  replacement 
factor.  These  chips  were  the  SPDL  for  Signal  Processing  Delay  Line,  and  the 
SPAC  for  Signal  Processing  Address  Control.  Two  designs,  the  SPAC  1 and  SPAC  2 
were  made  for  the  latter  function. 


SECTION  II 
DIGITAL  FILTERS 


"A  digital  filter,  In  broad  terms,  Is  any  device  which  accepts  a sequence 
of  numbers  as  Its  Input  and  operates  on  them  to  produce  another  number  sequence 
as  Its  output."1  According  to  this  definition,  a digital  computer  or  a calcu- 
lator are  the  best  known  examples.  However  elegant  these  machines  may  be,  they 
are  perhaps  never  cost-effective  nor  capable  of  real  time  digital  filtering  In 
such  applications  as  communications,  radar,  and  sonar.  Such  jobs  are  typically 
relegated  to  analog  filters  or  special  purpose  custom  built  digital  filters. 
Analog  filters  are  capable  of  the  highest  speed  performance,  but  may  lack  accu- 
racy, versatility,  and  freedom  from  drift.  It  Is  on  these  last  three  factors 
that  any  decision  to  build  a digital  filter  rests. 

The  fundamental  arithmetical  operations  of  digital  filters  are  addition, 
delay,  and  shifting.  Multiplication,  a product  of  addition,  delay,  and  shifting 
Is  further  adopted  as  a convenience.  Thus,  to  build  an  arithmetic  unit, only 
these  four  operations  are  considered  necessary  to  Implement  in  this  work.  Another 
consideration  Is  to  do  each  operation  In  parallel  by  digital  words  and  Increase 
the  operational  speed  of  the  electronic  machine. 

The  length  of  the  digital  word  used  In  digital  filters  Involves  the  accu- 
racy requirement  and  the  specific  kind  of  filter  used  In  an  application.  In 
general,  recursive  filters  due  to  accumulation  of  quantization  or  round-off 
errors  require  greater  word  length  than  do  non-recurslve  filters  for  a given 
accuracy.  The  first  general  purpose  Signal  Processing  Arithmetic  Unit  design, 
SPAU  l.uses  a 12-bit  word  length,  sign  plus  11  bits.  This  was  a compromise 
since  It  Is  difficult  to  retain  sufficient  accuracy  using  12-bit  word 
length  and  single  precision  computation.  The  SPAU  1,  has  capability  for  double 
precision  computation  but  at  the  expense  of  longer  programming  steps.  A second 
design  SPAU  2 also  uses  12-bit  word  length,  but  all  computation  on  the  chip  Is 
carried  out  double  precision  with  much  better  capability  for  retaining  adequate 


accuracy  for  digital  flit**1  |>rob) «>«)> . 1 hf»  trade-off*  between  SI'AU  1 ami 

SI'AU  i are  discussed  In  detail  In  the  tost 

During  the  period  of  this  contrast,  NM  to  10//,  the  general  yield  *apa- 
bllity  of  the  IM  technology  rose  from  about  MHW  devUes  to  over  40,000 
devices,  lhe  Initial  design  goals  In  tonus  of  uevUe  complexity  were  sot  at 
approximately  lb, 000  Devices,  SI'AU  1 ami  M'AU  were  respectively  lb, 000 
Device  ami  hi ,000  Device  I SI  Mips,  However,  toe  Device  count  was  never  a 
limiting  factor  a ml  much  greater  complexity  coulD  have  been  used  ex,ept  for 
practical  factors  such  as  package  pin  limitations,  lhe  package  pin  limit  was 
set  at  b4  leads  ami  standard  flat  packs  and  dips  DID  not  Increase  over  this 
limit  throughout  this  period.  Although  timesharing  of  Input/output  pins  ,an 
help  and  was  extensively  used  In  SI'AU  I a penalty  Is  paid  due  to  Interfer- 
ence tn  loading  and  retrieving  Information,  and  efficient  programming  was  more 
difficult,  lhe  SI'AU  .'  on  the  other  hand  has  had  four  ports,  two  Input 
ami  two  output  and  the  port*  are  not  t Ime-shaosd,  Ihls  turned  out  to  be 
simpler  and  more  generally  efficient. 

.'.I  UUUIAl,  MM1K  S1UU1 

lhe  end  purpose  of  this  work  was  to  Implement  a wide  class  of  digital  r 1 1 
-ers  tn  isl  and  determine  the  design  requirements  of  such  LSI.  A Joint  study 
was  conducted  by  members  of  the  Systems  Uroup  along  with  Nicroelect ronlcs  per- 
sonnel . Digital  filter  systems  wet-e  taken  under  t'eview  that  had  previously 
received  design  attention  using  SS 1 Ms  1 tsl  devices,  Ihese  were  taken  as  object 
lessons  for  circuit  partitioning  studies  and  much  was  learned  w<th  aspect  to 
the  circuit  architecture  for  a universal  arithmetic  chip. 

The  course  chosen  was  further  defined  as  being  one  wheiw  the  I SI  amid  be 
used  If  It  made  common  sense  to  do  so  and  not  otherwise.  It  ts  not  expected  that 

] 

cost  effective  1st  will  be  used  entirely  throughout  a system  although  there  may 
not  he  technical  object  tons  to  doing  so,  lor  example,  HAM  meiu>  Hex,  ROM's  and  MsI 
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buffer  electronics  will  continue  to  be  used.  To  exclude  these  fine  tuned  prod- 
ucts for  a ritual  LSI  would  overburden  the  LSI  design  when  It  Is  unnecessary  to 
do  so,  and  detract  from  the  LSI  where  It  should  be  a worthy  replacement. 

When  Is  an  LSI  a worthy  replacement?  A contemporary  digital  filter  uses 
MSI/SSI  off-the-shelf  parts.  These  are  typically  general  purpose  and  have  con- 
siderable flexibility  with  respect  to  Interconnection  by  the  ample  number  of 
input-output  leads  available.  On  the  other  hand,  the  LSI  will  be  lead  limited 
and  have  to  tlmeshare  leads.  It  cannot  be  (the  LSI)  a one-for-one  replacement 
of  disjoint  MSI  arithmetic  If  It  Is  to  be  a worthy  replacement.  This  places 
an  additional  burden  upon  any  stuc(y,  for  it  most  assuredly  means  that  any  system 
will  also  have  to  be  reconfigured  for  the  LSI  and  then  it  can  be  judged.  Such 
was  the  method  used.  Quite  naturally  a certain  a-prlorl  knowledge  of  what  the 
LSI  was  to  Include  had  to  be  assumed.  The  system  stuc(y  then,  was  a progressive 
learning  experience  from  one  system  to  the  next.  The  LSI  circuit  organization 
was  realigned  for  more  effective  and  universal  adaptation  as  the  stuc(y  progressed. 

The  first  study  was  on  a system  called  the  "K-Band  Report  Back  Demodulator". 
The  FFT  Kernel  processor  portion  of  this  Item  was  configured  for  LSI  and  compared 
with  the  MSI/SSI  version  in  terms  of  external  Interconnect  pins,  circuit  board 
area,  maximum  power,  and  total  parts.  The  LSI  design  was  based  on  equivalent 
performance  factors  with  the  MSI/SSI  actual  equipment.  The  results  are  shown  In 
Table  1.  The  savings  are  approximately  60%  In  all  four  categories  In  favor 
of  the  LSI . 

A similar  stuc|y  was  done  using  the  GPS  x User  Set  (Global  Positioning  System) 
as  a model.  Two  electronic  processing  blocks,  the  GPS  preprocessor  block,  and 
the  GPS  aemodul  ator  blocks  were  redesigned  for  LSI  with  approximately  the  same 
savings  as  for  the  K-Band  report  back  demodulator  FFT,  60%. 

Another  w^y  to  look  at  the  LSI  and  compare  It  with  MS  I /SSI  savings  Is  Illus- 
trated In  Figure  1.  Here  the  block  diagram  of  SPAU  1 Is  reproduced  and  an 
estimate  of  the  number  of  MSI/SSI  IC's  required  to  duplicate  its  function  Is  made. 
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TABLE  1.  K-BAND  REPORT  BACK  DEMODULATOR  FFT  BUTTERFLY 
ARITHMETIC  UNIT  COMPARISONS 


1 

MSI/SSI 

K-Band 

Butterfly 

LSI/MSI 

K-Band 

Butterfly 

LSI 

Reduction 

Percentages 

I Total  Maximum 

1 External 

1 Interconnect 

Pins 

1434  pins 

628  pins 

56% 

Nominal 

Circuit  Board 
Mounting  Area 

59  In2 

35  in2 

44% 

Total  Maximum 

Power 

1 

50  watts 

20  watts 

60% 

Total  Parts 

I 

76  Dips 



2b  Dips 

66% 

As  shown  In  the  figure  one*SPAU  1 has  the  equivalent  electronic  function 
of  55  TTL  IC's.  And,  based  on  military  grade  cost  of  TTL  In  the  100  quantity 
price  bracket,  the  overall  equivalent  worth  of  the  SPAU  1 Is  shown  to  be  $747. 

The  approximate  selling  price  of  the  LSI  tested  to  military  specifications  Is 
$207.  Iherefore,  the  direct  cost  effectiveness  of  the  LSI  based  on  only  the 
packaged  chip  cost  can  be  several  times  more  effective  than  the  MSI/SSI.  If 
one  were  to  take  Into  account  supporting  hardware,  stocking,  Incoming  Inspection, 
maintenance,  spares,  additional  power  supply  support,  and  the  like,  then  the 
cost  effectiveness  of  the  LSI  can  be  even  greater. 
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Figure  1.  The  MSI/SSI  Functional  Equivalence  of  the  SPAU  1 


2.2  Discrete  Fourier  Transform,  DFT 


The  Fourier  Series  and  the  Fourier  Integral  have  had  power  Influence 
on  physical  analysis.  Although  not  used  extensively  In  the  past  except 
as  a special  analytical  tool,  the  discrete  Fourier  transform,  DFT,  requires 
a rather  excessive  amount  of  computation  time.  With  the  advent  of  the  Fast 
Fourier  Transform,  FFT,  computation  time  Is  greatly  reduced  allowing  this 
powerful  tool  to  be  used  In  real  time  signal  processing.  In  this  work  spe- 
cial attention  has  been  directed  to  Implementation  of  the  DFT  and  FFT  In 
addition  to  general  purpose  digital  filters. 

However,  a complete  spectrual  distribution  of  frequency  terms  from  a 
sampled  time  series  may  not  be  required.  Sometimes  in  real  time  servo 
applications  only  one  or  a limited  set  of  frequency  terms  may  be  needed. 
Because  of  the  harctoare  simplicity  the  DFT  may  then  be  the  preferred  method. 
Another  reason  is  that  the  DFT  can  be  computed  along  with  data  acquisition 
thereby  eliminating  the  need  for  data  storage  and  avoiding  the  latency 
period  required  by  the  FFT.  Still  another  reason  Is  based  on  the  number  of 
samples  to  be  taken.  The  number  is  arbitrary  In  the  DFT  scheme;  however, 

In  the  FFT  pairing  scheme  additional  complication  results  unless  the 
samples  are  a power  of  two. 

In  the  extensive  literature  on  FFT  the  efficiency  of  the  FFT  over  the 
DFT  In  terms  of  the  materially  reduced  number  of  "operations"  Is  usually 
given  as  the  reason  for  using  the  FFT  algorithm.  The  number  of  operations 
usually  quoted  Is 

No. of  Operations^  ppy  * (N/2)  log;(N) 

Ixo .of  Operations^  Qpy  - N2/2 

where  N ■ no.  of  data  points 
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Now  these  expressions  In  terms  of  "Operations"  depend  on  how  the  algorithms 
are  Implemented.  In  this  work  using  SPAU  2 the  kernel  or  butterfly  oper- 
ation for  the  FFT  requires  10  steps  of  200  ns  each.  SPAU  1 utilization  Is 
slightly  faster,  1920  ns  for  full  butterfly  compared  to  SPAU  2's  2000  ns. 

The  "operation"  In  the  DFT  on  the  other  hand  Is  only  2 steps  of  200  ns  each, 
each  being  a multiply  and  simultaneous  accumulation,  for  400  ns.  Equating 
these  operations  performed  by  the  SPAU  2 hardware  then  has  the  ratio  of  DFT 
to  FFT  time  of  solution  as  follows: 


Ratio 


FFT  ’1 

^ N SFFT 

j log2(N) 

a 

sdftn2 

DFT 

2 

sdft"2 

M. 

2 

_/SDF 

t\/  n 

FFT  ( 

' ^ SFFT> 

>io92(N)  Vs" 

•) 


where:  SDFT  = 2 Operations  at  200  ns  each  using  SPAU  2 
(FOR  REAL  DATA  ONLY  INPUT) 

Sppj.  ■ 10  Operations  at  200  ns  each  using  SPAU  2 

The  above  formula  only  considers  the  case  where  input  data  is  real  for 

SFFT 

the  DFT,  otherwise  for  complex  data  SDfrr  * -y-  . Many  cases  where  real 
sampled  data  Is  taken  from  transducers  fits  this  case.  On  the  other  hand, 
complex  Input  data  Is  almost  always  required  to  be  computed  by  the  FFT  even 
when  the  system  Input  Is  only  real  data. 

Figure  2.2.1  shows  a graphical  comparison  of  the  ratio  of  the  DFT  to  FFT 
solution  time.  Note  that  for  a small  number  of  points,  less  than  about  32, 
the  DFT  Is  faster  for  the  implied  conditions.  For  a lerge  number  of  points 
the  FFT  method  Is  much  faster.  However,  where  a large  number  of  points  are 
collected  for  solution  resolution,  but  where  a full  spectrum  solution  Is  not 
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otherwise  required,  then  the  breakeven  curve  shown  In  Figure  2.2. 1 gives 


the  number  of  DFT  output  points  for  equal  solution  time  with  the  FFT. 
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Figure  2.  Comparison  of  DFT-FFT  Solution  Time 


DIRECT  DR,  D2 FT  , 

The  DFT  for  time  to  frequency  domain  Is  shown  below.  Samples  are  taken 

uniformly  at  period  T. 

N-l 


• £ 


* , K - 0.  1,  2,  • • 

where  N ■ no.  of  samples 
T ■ Sample  period 


Fk  ■ K—  frequency  term  at  freq.  * jy- 


fn  - n data  sample 
WK  . e-J2nK/N  . e‘Jek 


■ cos  e^-j  sin  6^  , rotation  coefficient 
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Direct  solution  fOr  the  K—  frequency  term  Is  then 
N-l 

F„  • fnWnk 


K - Ammd  ■■■"  " f0+fi  (cos  6^-j  sin  6^)+  f2  (cos  2^-j  sin  2«K)  + 

n«0 


fN-1  [cos(N-l)eK-j  s1n(N-l)eKj 


For  the  case  where  the  samples,  fR,  are  real  and  two  SPAU  2 are  used, 
one  for  computing  reals  and  the  second  for  computing  Imaglnaries,  the 
solution  time  Is  200  ns.,  T,  per  step  for  each  frequency  component. 

This  Is  Implemented  as  shown  in  the  figure  below.  Figure  3. 


Figure  3.  Direct  DFT 
2.3  Fast  Fourier  Transform,  FFT 

As  noted  In  the  last  section  the  amount  of  computation  using  the  D2FT 
for  large  N and  where  the  full  spectrum  is  to  be  analyzed  Is  proportional 
to  N2.  The  FFT  methods  use  a sample  Index  sorting  scheme  called  decimation 
which  results  In  successively  dividing  into  smaller  DFT's.  Carried  to  the 
ultimate  on  N samples,  where  N Is  a power  of  two,  the  array  Is  decomposed  to 

pairs  of  samples  or  Intermediate  products  which  connect  to  other  pairs. 

Each  pair  forms  a computational  block.  The  operation  on  these  pairs  by  a 
rotational  coefficient  Is  called  the  kemal  operation  or  more  frequently 

11 

l fc  J 


called  a butterfly,  from  a resemblance  of  the  flow  graph  diagram  to  a 
butterfly. 

For  solution  of  the  kernal  or  butterfly  we  show  the  diagram  In  Fig- 
ure 4.  This  represents  the  DFT  solution  of  two  points  which  have  been 
paired  by  the  FFT  scheme  for  time  to  frequency.  The  complex  FFT  points 
are  represented  by: 

1st  Sample  X1  + jY1 

2nd  Sample  X2  + j 

These  are  operated  upon  by  the  rotation  coefficient  functions  s1nu 
and  cosw.  The  particular  solution  we  are  implementing  is  shown  below. 

The  primed  quantities  represent  the  transformed  values. 

Xi+X2cose  + Y2s1ny  * X +Z^  = X^ 

Y^XgSln®  + Y2cos*  = Y1+Z2  - Y' 

X^-X2cose  - Y2s1ne  = X^-Z-j  ' X2 
Y^+X2s1nw  - Y2cose  = Y1-Z2  = Y' 

Because  of  the  Importance  of  the  Fourier  Transforms  to  this  implemen- 
tation work,  the  DFT  and  FFT  are  taken  up  in  more  detail  In  Section  3. 


A cost  trade-off  study  was  made  on  a system  which  we  will  call  a Kernel 
Processor.  This  system  was  a major  part  of  an  ongoing  design  at  TRW  during 
the  time  of  the  study  and  represented  an  MSI  application  using  available  T TI- 
MS I MIL  grade  chips.  The  LSI  designs  produced  from  this  contract  were  taken 
as  the  object  study, and  approximate  comparative  costs  were  derived. 

The  assumptions  used  in  the  study  are  listed  below. 

1)  The  system  life  cycle  is  ten  years. 

2)  Parts  cost  reflected  to  the  customer  are  1.43  times  actual  cost. 

3)  Development  costs  are  2.63  times  actual  costs. 

4)  An  interest  rate  of  8%  is  used  to  calculate  a discount  factor 
of  67%  applied  to  costs  distributed  throughout  the  ten  year 
life  cycle. 

5)  An  attrition,  warehousing,  and  interest  costs  of  10%  per  year 
is  applied  to  all  spares. 

6)  A burden  factor  of  26%  is  applied  to  all  spares  purchases. 

7)  Mounting  and  wiring  costs  of  parts  is  31 C per  pin  based  on 
1-1/2  terminations  per  pin,  including  software  costs. 

The  LSI  Chip  Development  costs  are  taken  up  first.  There  is  no  compara- 
ble cost  for  the  MSI  trade-off  since  off-the-shelf  MSI  chips  are  used.  This 
item  is  the  greatest  penality  the  LSI  design  must  overcome  for  the  custom 
chip  design.  On  the  other  hand,  at  the  stage  when  LSI  chips  of  the  capabili- 
ty of  the  SPAU  are  available,  then  the  development  cost  would  not  have  to  be 
supported.  We  take  the  conservative  approach  for  three  new  custom  designs  in 
this  study. 


LSI  DEVELOPMENT  COSTS  (3  CHIPS) 

1.  Partitioning 

2.  Logic  Simulation 

3.  Timing  Analysis 

4.  Flowcharts 

5.  Specifications 

1 . Circuit  Design 

2.  Circuit  Simulation 

3.  Layout 

4.  Verification 

5.  Fabrication 

6.  Testing 

Total  LSI  Development  Costs  $240,000 

Following  the  LSI  Chip  Development,  the  equipment  unit  development  costs 
for  the  LSI  and  the  MSI  versions  are  derived  and  computed.  The  material  and 
assembly  labor  completes  the  acquisition  cost  for  the  first  development  model 
built  both  ways.  Costs  are  listed  for  both  initial  and  recurring  with  a one- 
time 25%  learning  reduction  for  the  second  and  subsequent  systems.  Table  2 
shows  an  Itemized  listing  of  these  costs. 


TABLE  2.  KERNEL  PROCESSOR 


ITEM 

LSI 

$INIT I AL/ RECURRING 

MSI 

$INITIAL/ RECURRING 

Unit  Dev.  Costs, 860  hrs  @$15/hr 

33,927/0 

66,000/0 

Labor 

4340/3255 

591 5/4436 

Integration 

6830/5123 

8200/61  50 

Spares 

8015/6011 

9516/7137 

Publ ication  Costs 

2000/200 

2000/200 

Total  Initial/Recurring  Costs 

55,112/14,589 

91  ,631/17,923 

System  Engineering 
(3  Chips)  $95,000 


Microelectronic  Engineering 
(3  Chips)  $145,000 
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Sunning  up  the  results  It  Is  seen  that  the  recurring  LSI  costs  are  23% 
less  than  the  MSI  version.  While  not  nearly  as  spectacular  as  the  view 
taken  in  section  2.1  (see  Figure  1)  this  Is  a realistic  conservative 
calculation  which  takes  into  account  a practical  system  built  using  LSI  and 
other  available  electronics. 

If  the  LSI  development  costs  are  to  be  amortized,  then  the  breakeven 
number  of  systems  to  be  built  must  be  62  units.  Thereafter,  the  savings 
as  listed  above  would  be  realized.  This  calculation  has  not  included  weight 
and  power  supply  savings  of  the  LSI.  For  air  and  spacecraft  electronics, 
these  could  be  substantial  additional  advantages  for  the  LSI. 


15 


SECTION  III 

SIGNAL  PROCESSING  ARITHMETIC  UNIT,  SPAU,  LSI 

The  most  comprehensive  LSI  Chip  undertaken  in  this  work  was  the  SPAU. 
The  first  design  of  the  SPAU  1 was  more  complicated  with  more  subfunctions 
and  controls  than  the  redesigned  SPAU  called  SPAU  2.  iioth  chips  have  ad- 
vantages; however,  in  terms  of  simplicity  In  programming,  performance,  re- 
duced power,  and  accuracy  In  carrying  out  arithmetic  operations  the  SPAU  2 
is  superior  and  represented  an  evolutionary  design  step  for  this  kind  of 
hardware. 

3.1  SPAU  1 , Introduction 

Much  early  attention  was  given  to  the  SPAU  1 architecture  which  was 
derived  by  examining  existing  system  applications  and  then  partitioning 
these  for  efficient  LSI  usage.  The  systems  examined  were  ones  which  used 
MSI  chips  and  generally  performed  arithmetic  in  parallel  for  high  perform- 
ance. It  was  our  contention  that  LSI  would  be  proved  or  not  based  on  its 
general  acceptance  as  replacement  hardware  for  MSI  - not  one-for-one  re- 
placement, but  design  concept  replaceable  for  MS!  hardware. 

The  most  difficult  item  to  satisfy  with  the  LSI  is  the  limited  inter- 
connect compared  to  the  MSI.  For  parallel  arithmetic  this  generally  means 
timeshared  I/O  ports  and  on-chip  holding  and  transfer  word  wide  registers. 
Also,  in  keeping  with  the  idea  to  have  this  SPAU  as  flexible  and  general 
purpose  as  possible,  the  control  structure  was  extensive  leading  to 
complication  in  programming. 

The  SPAU  1 design  was  completed  and  first  hardware  fabricated.  Tests 
were  generally  satisfactory  except  that  a few  minor  problems  were  found. 

Two  signal  inversions  and  a high  temperature  sensitivity  traced  to  a 
saturation  condition  in  the  output  CML  stage  was  noted.  These  were  readily 
correctable  using  minor  layout  changes  and  remasking.  However,  extensive 
tests  carried  out  on  the  chip  suggested  further  architecture  changes  in  the 
control  structure  by  incorporation  of  a mask  programmable  PLA.  The  fixed 
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microcode  In  the  original  design  was  not  optimum  and  1 t was  envisioned  that 
the  mask  programmable  feature  would  make  the  SPAU  more  attractive  for  spe- 
cial tasks.  The  second  layout  which  Incorporated  these  changes  was  the  sub- 
ject of  extensive  work  although  later  supplanted  by  the  SPAU  2.  In  the  next 
section  the  design  aspects  of  the  SPAU  1 are  taken  In  detail. 

3.2  SPAU  1 , Design 

The  SPAU  1 has  a readily  understandable  architecture.  It  does  register 
transfer,  parallel  multiplication  and  simultaneous  addition,  and  disjoint 
addition.  It  Is  programmed  by  means  of  an  on-chip  configuration  register 
from  off-chip  microprogram  control  or  from  hardware  control.  SPAU  1 design 
handles  12-bit  2's  complement  parallel  words.  The  simplified  block  diagram 
for  this  chip  Is  shown  In  Figure  5.  Three  12-bit  wide  Inputs  are  pro- 
vided. Each  has  a provision  for  multiplexing  to  one  of  two  registers.  One 
12-bit  work  output  Is  provided.  Registers  T,  D,  and  A Interface  with  a 
12x12  bit  parallel  multiplier  and  simultaneous  adder  forming  the  product-sum 
D*T;1A.  Registers  P and  S Interface  with  a fast  parallel  adder,  forming  the  sum 
S#.  Registers  R and  X are  loaded  with  respective  inputs  and  Interface  with  an 
Internal  transfer  bus.  The  Internal  bus  multiplexes  back  to  register  P or  to 
the  output  tristate  off-chip  buffers.  The  sign  and  1 1 -b 1 t product,  MSB's,  are 
derived  from  the  multiplier-adder.  Provision  Is  also  available  via  configuration 
control  to  output  the  sign  and  the  11 -bit  LSB  product  from  the  multiplier  In  a 
second  operation  back  to  register  A and  thence  to  register  R,  etc. 

The  SPAU  operations  are  sequenced  and  controlled  by  means  of  a clock  de- 
coder, fast  (direct)  control  logic,  and  a slow  (Indirect)  configuration  register 
shown  as  C In  the  block  diagram.  Having  all  direct  controls  Into  the  chip  might 
further  facilitate  utilization  of  the  SPAU  for  diverse  applications;  however, 
this  would  have  violated  the  64  leads  Imposed  as  a maximum  for  the  LSI. 
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Figure  5.  SPAll  Block  Diagram 

/ V photograph  of  the  SPAU  Is  shown  In  Figure  6.  This  chip  has  13,800 
ilevlces,  dissipates  5.1  watts,  and  has  a chip  size  of  315  x 351  mils.  The 
clock  period  Is  120  nsec,  ?r.d  a typical  processing  time  for  an  Instruction 
Is: 


l oad  P or  1 and  Multiply 


t max 

p register 

60 

35 

Product  driver 

2b 

15 

Product  propagation 

168 

140 

CMl  R register  setup 

15 

10 

Internal  clock  skew 

10 

_2 

278 

207 

Time  allocated  360 

nsec 

(nsec) 
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Mqure  6.  Photo  of  SPAU 


The  introductory  functional  block  diagram  for  the  SPAU  Is  again  shown  In 
Figure  7.  This  portrays  the  control  functions  by  means  of  switches  con- 
nected to  their  respective  locations.  The  Input  q Is  a 4-bit  Input  signal  to 
the  clock  decode  circuitry.  This  determines  which  registers  are  to  be  clocked 
by  the  clock  Input  (labled  CLOCK).  Input  q also  has  one  additional  state.  KF, 
which  provides  multiplexing  control  to  register  A for  LSB  retention  from  the 
multiplier. 

The  direct  or  fast  controls  are:  Reset,  RSTF;  Internal  bus  transfer, 

RTBT ; add/subtract  for  the  contents  of  P register,  CADDT ; tri-state  output 
control,  BZZT;  and  configuration  register  clock,  Cl. 

The  Indirect  or  slow  controls  labeled  Kl  through  K5  In  the  diagram  are 
obtained  from  the  contents  of  the  configuration  register  C.  This,  In  turn, 

Is  loaded  from  the  s Input  by  the  Cl  direct  control.  Note  that  logical  "0" 
state  can  also  be  loaded  Independent  of  s Input.  All  switches  In  Figure  7 
are  shown  In  the  logical  "0"  state. 

The  main  features  of  the  SPAU  1 are  listed  In  Table  3.  Reviewing  this 
list  In  combination  with  the  functional  block  diagram  Is  self  explanatory.  Com- 
plete sepclflcatlons  of  the  SPAU  1 are  given  In  Appendix  A.  Taking  these  fea- 
tures and  the  control  flexibility  offered,  the  principal  aim  of  this  organi- 
zation for  the  SPAU  has  been  made  to  allow  pipelining  digital  filter  algorithms 
through  one  or  more  of  these  LSI  chips  depending  upon  performance  requirements. 

Appendix  B Is  a detailed  layout  and  circuit  design  discussion  for  the 
SPAU  1.  The  method  of  layout,  cell  placement,  signal  routing,  schematic  dia- 
grams of  all  cells  are  given. 
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TABLE  3.  FEATURES  OF  SPAU  1 


LS-TTL  COMPATIBLE  INPUT/OUTPUT 
TR I STATE  TTL  OUTPUTS 
TTL  INPUTS 

TIME  MULTIPLEXED  INPUT/OUTPUT 
SYNCHRONOUS  SELECTABLE  REGISTER  RESET 
INTERNAL  CONTROL  STORAGE 
INTERNAL  REGISTER  ADDRESSABILITY 
MOST  REGISTERS  EDGE  TRIGGERED  D TYPE 

OPERAND  INPUT  HOLDING  REGISTERS  FOR  ALL  ARITHMETIC  FUNCTIONS 
A RESULT  STORAGE  REGISTER  FOR  EACH  ARITHMETIC  FUNCTION 
ONE  NON-LIMITING  TWO'S  COMPLEMENT  ADDER  WITH  ADD/SUBTRACT  CONTROL 
ONE  LIMITING  TWO'S  COMPLEMENT  ADDER  WITH  ADD/SUBTRACT  CONTROL 
ONE  TWO'S  COMPLEMENT  MULTIPLIER 

TWELVE  (12)  BIT  PRECISION  WITH  23  AVAILABLE  PRODUCT  BITS 

OUTPUT  SCALE  BY  HALF/NO  SCALE 

INTERNAL  ACCUMULATE  FUNCTION  ON  LIMITING  ADDER 
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3.3  COMPARATIVE  ANALYSIS  OF  SPAU  I AND  SPAU  2 DESIGNS 


Two  variations  of  circuit  organizations  for  LSI  processing  of  digital 
data  have  evolved.  Both  of  these  address  the  general  problem  of  12-bit 
parallel  word  Iterative  processing  for  general  purpose  digital  filtering  and 
FFT  In  particular.  The  SPAU  1 was  the  earlier  concept  while  the  SPAU-2  approach 
was  more  recently  evolved  as  a simplified  architecture  approach.  The  purpose 
of  this  analysis  Is  to  review  both  designs  on  a logical  block  basis  and  draw 
some  comparisons.  The  main  equivalence  conditions  adopted  for  this  analysis 
Is  that  the  operational  processing  time  Is  the  same  for  both  and  the  LSI 
packaging  constraints  In  terms  of  number  of  pins  available  are  the  same; 

64  pin  DIP  packaging  Is  assumed. 

3.3.1  Logical  Block  Diagrams 

SPAU  1 

The  logical  block  diagrams  for  the  SPAU  variations  are  shown  in  Figure  8 
and  Figure  9,  respectively.  Both  of  these  have  four  ports  for  I/O.  In 
the  case  of  the  SPAU,  three  ports  (each  12  bits  wide)  are  tlmeshared  Inputs 
feeding  six  registers.  The  two  main  operations  performed  are  multiplication, 
simultaneous  addition/subtraction  in  the  "multiplier"  block  and  disjoint 
addition/subtraction  in  the  parallel  adder/subtractor  block.  These  two  main 
operations  can  be  performed  simultaneously  with  respective  outputs  stored  to 
two  additional  registers  R and  X.  The  two  registers  are  then  multiplexed  to 
the  chip  output  port  on  demand  or  internally  multiplexed  to  the  P register 
holding  one  of  the  operands  interfacing  the  adder/subtractor.  Due  to  the 
large  number  of  on-chip  registers  and  controls  required  most  of  the  clocking 
and  controls  are  Indirect  utilizing  a four  bit  clock  decode  block  on  the 
contents  of  register  C provided  to  store  controls  required  less  frequently  used 
(so  called  slow  controls).  Also,  direct  fast  controls  are  available  for 
clock  timing, add/sub  tract  control , internal  signal  mux,  round,  scale,  two 
input  register  reset/set  over-ride  controls,  and  output  tristate. 

The  basic  processing  rate  of  theSPAUl  Is  one  microcycle  period  of  120  ns 
for  an  Add/subtract  and  two  microcycle  periods  or  240  ns  for  a mul  tiply-simul - 
taneous  addition.  The  complexity  of  the  control  structure  Is  such  as  to 
generally  require  a pipelined  approach  to  solving  an  algorithm  and  obtaining 

efficient  use  of  the  processing  capability. 
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The  SPAlJ-2  logical  block  diagram  is  shown  in  Figure  9.  This  is  a consider- 
able simplification  over  the  former.  Input  ports  are  direct  to  the  registers 
interfacing  the  12-bit  parallel  multiplier.  Two  output  ports  are  provided 
for  the  least  significant  product/cumulation  and  the  most  significant  product/ 
cumulation.  All  controls  are  direct.  The  SUB  control  either  subtracts  or 
adds  the  contents  of  the  accumulator  to  the  multiplier  product  for  the  next 
microcycle  operation.  The  ACC  control  enables  the  accumulation  or 

disables  (adds  zeros  to  product)  such  that  the  product  is  available  at  the 
outputs  or  the  product/cumulation  is  available.  Note  that  ACC  and  SUB  are  in 
the  feedback  loop  with  the  result  that  no  clear  or  initialization  setup  is 
ever  required.  Four  additional  bits  are  available  making  the  total  cumulation 
capacity 27  bi ts.  The  typical  microcycle  period  is  175  ns  for  multiplication/ 
accumulation. 

Table  4 shows  the  salient  comparative  features  of  the  two  organizations 
from  a logical  block  point  of  view. 

3.3.2  Comparative  Advantages  and  Disadvantages 

SPAU 

The  chief  advantage  of  theSPAUl  organization  is  one  which  allows  two 
disjoint  operations  to  go  on  simultaneously,  i.e.,  multiply/sum  and  summing 
operations.  In  principle,  one  multiply  and  three  sums  can  be  conducted  in 
250  ns  for  a maximum  operation  rate  of  (250/4)  = 62.5  ns  per  operation. 

However,  this  typically  can't  be  sustained  in  a normal  problem  algorithm- 
solving  way  due  to  the  timeshared  terminals  and  the  indirect  controls.  One 
particular  type  of  problem,  FFT,  that  has  been  studied  at  length  shows  a 
practical  rate  of  192  ns  per  operation. 

TheSPAUl  has  harAyare  limiter  and  scaling  provisions.  These  are  neces- 
sary due  to  the  limited  number  field  available.  The  control  structure  is 
fairly  complicated  and  the  clock  decoder  has  a limited  set,  16,  of  control 
combinations.  This  control  set  can  be  changed  by  mask  programming  to  an 
optimum  set  for  a particular  service  such  as  FFT  use. 
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TABLE  4.  LOGICAL  FEATURES  OF  SPAU  AND  SPAU  2 
ORGANIZATIONS  SIZE  AND  POWER 


ITEM 

SPAU 

SPAU-2 

REMARKS 

Input  Ports 

3,  timeshaned  to 

6 destinations 

2 direct 

Output  Ports 

1,  Internal  MUX 
from  2 sources 

2 direct 

Main  Operations 

D+T*  A, 250  ns 

S*  P.125  ns 

X*Y*A,175  ns 

SPAU  permits  simultaneous 
operations 

Product  Significance 

SGN  +22 

SGN  +11  MSP, 250  ns 
SGN  +11  LSP.500  ns 

SSN  +22,175  ns 

SPAU  tlmeshares'  output 
port  for  MSP  and  LSP 

Accumulator  Significance 

SGN  +11,  125  ns 

SGN  +26,175  ns 

Much  greater  significance 
for  SPAU  2 

Hardware  Limiting 

Yes 

No 

Scaler 

Yes 

No 

Controls 

7 direct 

13  Indirect 

9 direct 

Packaging 

64  lead  DIP 

64  lead  DIP 

Power 

5 watts 

2.5  watts 

Power  reduction,  SPAU  2 

Device  Complexity 

15,000  devices 

12,000  devices 

Reduced  compl exl  ty  ,SPAU  2 

Chip  Size 

315  x 351  mils 

256  x256  mils 

Smaller  chip  size,  SPAU  2 
Smaller  package,  SPAU  2 
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SPAU-2 

The  SPAU-2  organization  In  contrast  to  the  other,  features  direct  controls 
and  simplicity.  Our  experience  with  users  is  that  it  Is  considered  preferable 
perhaps  because  of  just  these  reasons. 

The  operational  rate  of  SPAU-2  Is  a multiply  and  summation  In  175  ns  or 88  ns 
per  operation.  In  the  next  section  several  digital  filter  problem-solving  sequences 
are  shown.  Using  the  FFT  kernel  sequence,  200  ns  (worst  case)  per  operation 
Is  obtainable.  Thus,  It  compares  favorably  with  the  SPAU  organization  as  far 
as  speed  goes. 

The  chief  advantage  of  the  new  organization  other  than  the  simplicity  Is 
the  extended  accuracy  available  due  to  the  additional  four  bits  of  accunulato 
significance.  The  SPAU  2 allows  double  precision  arithmetic,  yet  this  is 
transparent  to  programming.  Also,  the  LSB's  are  brought  out  directly  and 
available  without  additional  sequencing  time. 

3.3.3  Conclusions  on  the  SPAU  Alternatives 

The  development  effort  on  the  SPAU  1 LSI  was  well  spent.  It  has  di- 
rectly led  to  extensive  circuit  development  as  well  as  a critical  study  of 
alternate  architecture  concepts  and  their  effectiveness  In  actual  user  appli- 
cations. This  study  has  identified  a simplified  SPAU  architecture  called 
SPAU-2.  The  simpler  approach  is  clearly  the  more  useful  LSI  architecture. 

It  has  extended  accuracy  and  is  more  readily  understandable  and  accepted  by 
users.  Also,  it  has  less  power  and  the  same  operational  throughput  as  the 
original  SPAU.  The  programming  is  materially  simplified. 

For  these  reasons,  the  SPAU-2  design  was  adopted  over  the  SPAU  1 implementa- 
tion. The  hardware  delivery  was  completed  using  this  more  acceptable  design. 

The  -2  chip  will  directly  replace  the  SPAU  In  all  applications  we  have  studied. 

A case  example  Is  the  DFDU  (Digital  Filter  Demonstration  Unit)  described  in 
Appendix  C.  Herfe  the  -2  can  replace  the  SPAU  in  the  FFT  Processor  and  better 
augment  the  output  processor  in  a more  efficient  way,  primarily  because  of  the 
extended  significance  of  the  accumulator  requiring  materially  less  supporting 
chips. 

This  conclusion  and  Indeed  the  -2  concept  Itself  is  based  not  only  on  the 
studies  carried  out  under  the  LSI  Implementation  contract,  but  through  a continued 
Interface  with  users  and  potential  customers  for  this  type  of  hardware  over  the 
past  two  years.  Therefore,  this  conclusion  is  not  lightly  drawn  but  believed  to 

be  In  the  best  tradition  of  practical  development  experience. 
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3.4  SPAU  2,  Multiplier-Accumulator  Design 

The  Multiplier-Accumulator  LSI  Chip  organization  Is  a rather  obvious  one 
for  digital  signal  processing.  However,  It  can  rightfully  be  classed  as  an 
Innovation  because  It  has  only  been  recently  practical  in  a real  sense  to 
integrate  this  amount  of  circuit  complexity  on  a single  chip.  The  LSI  to 
VLSI  evolution  now  makes  this  a cost  effective  means  for  fast  harctoare  solu- 
tions to  digital  filter  problems.  The  classical  methods  of  carrying  out  such 
operations  involve  register  transfer,  multiplication,  and  summation  of  products. 
In  fact,  a full  parallel  word  pipelined  array  processor  can  have  little  else 
but  such  LSI  if  the  ultimate  In  real  time  performance  is  required. 

This  design,  the  SPAU  2,  was  spawned  by  the  parallel  work  on  the  SPAU  1 
and  the  on-going  work  with  large  parallel  multipliers.  In  connection  with 
customer  surveys  on  use  of  multipliers  it  was  apparent  that  the  majority  of 
applications  used  mul tipi ier-adder/accunulator  organizations , but  the  harctoare 
was  disjoint  and  separate.  We  also  queried  customers  both  here  at  TRW  and  in 
the  field  regarding  the  application  of  the  SPAU  1,  but  found  certain  reluctance 
toward  the  extensive  need  for  control  programming.  Main  thoughts  triggered  the 
multiplier-accumulator  approach,  and  as  discussed  in  the  last  section,  changed 
our  thoughts  on  how  to  accomplish  the  best  compromise.  We  think  the  SPAU  2 
represents  this. 

In  this  section,  we  discuss  the  design  and  follow  by  showing  the  use  of 
the  LSI  chip  in  typical  digital  filter  applications. 


The  simplified  block  diagram  for  the  SPAU  2 Is  shown  In  Figure  10. 

Two  inputs  X and  Y are  shown  at  the  top.  These  are  TTL  level  Inputs,  each 
accepting  12  bit  two's  complement  numbers  In  parallel.  These  are  clocked  Into 
Input  registers  using  separate  clock  signals. 

The  Input  registers  Interface  directly  with  the  12x12  bit  parallel  multi- 
plier. This  Is  an  asynchronous  logic  array  which  computes  continuously  producing 
a 24  bit  product.  (Sign  followed  by  23  bits,  see  I/O  number  format.)  So,  as 
soon  as  new  operands  are  clocked  Into  the  Input  registers,  the  multiplication 
sequence  starts.  At  the  product  edges  of  the  mul tipi  1 cation  array  an  additional 
set  of  adders  Is  provided.  The  24  bit  product  Is  Increased  to  27  bits  In  signi- 
ficance by  extending  the  sign  Into  the  higher  order  bits.  The  27  bit  adder  then 
accepts  directly  the  product  and  a 27  bit  number  from  the  output  register.  The 
output  from  the  adder  is  directly  connected  to  the  27  bit  output  register/accum- 
ulator. A 27  bit  output  under  3-state  control  Is  buffered  to  TTL  output  levels. 

Two  3-state  controls  are  provided  and  explained  later  In  the  text.  Figure  11  Is 
an  expanded  block  diagram  and  Figure  12  are  the  I/O  specifications  for  the  SPAU-2. 

Two  controls  are  provided  to  allow  either  a straight  multiplication  mode 
or  the  multiplication-accumulation  mode.  The  Accumulate  control  In  the  high 
state  passes  the  output  signal  to  the  adder,  and  the  contents  of  the  output 
reqister  (from  the  previous  operation)  are  added  to  the  new  X,  Y product.  When 
the  ACC  control  Is  in  the  low  state,  logic  zeros  are  fed  back  to  the  adder  dis- 
abling the  accumulate  mode  and  allows  the  product  only  to  be  captured  by  the 
output  register/accumulator. 

The  second  control,  Add/Sub,  is  only  active  In  the  ACC  mode.  This  control 
either  adds  or  subtracts  the  contents  of  the  output  register  to  the  next  sum- 
mation cycle.  In  keeping  with  2's  complement  arithmetic,  when  subtracting  the 
feedback  signals  are  complemented  and  a logic  1 .Is  Introduced  Into  the  LSB 
position  of  the  adder. 
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absolute  maximum  ratings  over  operating  temperature  range 

Supply  voltage 

Input  voltage 

Output  voltage 

Operating  temperature  range 

Storage  temperature  range 

Lead  temperature  (10  second*) 

Junction  temperature 


. -0.6  to  7.0  V 
. . .0  to  6.6  V 
. . 0 to  6.6  V 

. . 0#C  to  70#C 
-66°C  to  160°C 
....  300°C 
....  176°C 


recommended  operating  conditions 


Supply  voltage,  VCC 


Clock  pulse  width  (measured  at  1 .6  V level) 


Input  register  setup  time.  Tj  (tee  Figure  1 ) 


Input  register  hold  time,  rH  (see  Figure  1 ) 


Operating  ambient  temperature 
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electrical  characteristics  over  recommended  temperature  range 


TEST  CONDITIONS 
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switching  characteristics,  Vqq  - 5.0,  “ 25°C  (see  Figure  1) 
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Figure  12.  SPAU  2 Specifications 


1 


34 


The  multlpllcation/accumulatlon  cycle  Is  terminated  when  the  output  regis- 
ter Is  clocked.  The  complete  cycle  Is  measured  from  input  clock  to  output  clock, 
a period  of  175  ns  under  worse  case  room  temperature  conditions.  At  75°C  ambient 


temperature  the  worse  case  period  Is  200  ns  for  the  SPAU-2.  Continuous  oper- 
ation at  this  5 MHz  rate  can  be  maintained.  The  controls  ACC  and  Add/Sub  must 
be  held  In  the  same  state  over  the  multiplication/accumulation  cycle;  conse- 
quently, these  signals  are  registered  and  clocked  with  the  input  register  clock 
signals.  Either  CLKX  or  CLKY,  whichever  was  last,  holds  the  control  state. 
Logically,  this  is  a CLKX  OR  CLKY  signal. 

Logic  Diagram 

A simplified  logic  diagram  for  the  SPAU  2 chip  Is  shown  in  Figure  13.  Only  a 
3x3  multiplication  array  is  drawn  to  make  things  easier  to  visualize,  also  the 
input  registers  are  not  shown.  Note  that  the  multiplication  array  is  approxi- 
mately a square  matrix  of  adders,  actually  an  n by  n+1  array  while  the  periphe- 
ral adder  section  is  a 2n  linear  array.  Four  additional  MSB  are  appended  to 
the  adder  to  allow  for  number  expansion  while  accumulating.  As  can  be  seen 
from  the  diagram,  the  maximum  number  of  stage  delays  for  multiplication  is 
2n+2  while  for  accumulation  the  stage  delays  are  2n+3,  where  a stage  delay 
Is  taken  as  a full  adder  delay.  In  the  latter  case  the  extra  3 delays  are  due 
to  the  extended  significance  given  to  the  adder/accumulator.  Each  full  adder 
is  implemented  by  two  level  gates;  consequently,  the  average  gate  delay  is  less 
than  3 ns  in  the  SpAU  2. 

For  multiplication  and  accumulation  both  operations  are  simultaneous  so  no 
additional  delays  are  encountered.  This  is  the  most  attractive  feature  of  the 
organization  compared  to  disjoint  multiply  then  accumulate  hardware. 

The  subtract  control  complementer  in  the  feedback  path  is  an  exclusive  OR 
gate  enabled  by  the  ACC  control,  shown  as  an  AND  gate  in  the  Figure  3.4.4  diagram. 
The  LSB  1 Injection  is  a simple  AND  gate. 
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A third  Input  signal  not  discussed  before  Is  the  RND  control.  The  RND 
control  Injects  a logic  1 Into  the  multiplier  array  at  bit  significance  2"12 
using  2's  complement  fractional  notation.  Although  Injected  Into  the  multi- 
plier array,  the  X and  Y operands  do  not  operate  on  the  RND  signal.  It  always 

-12 

adds  2 to  the  product.  The  purpose  of  this  will  become  apparent  after  we 
discuss  the  output  provisions.  The  RND  control  like  the  other  controls  must 
be  held  stable  through  the  multiplication  cycle  so  It  Is,  registered  and  clocked 
by  Input  clocks  CLKX  or  CLKY,  whichever  was  last. 

It  Is  anticipated  that  a number  of  users  may  want  to  use  smaller  number 
fields  than  12  bits  for  both  X and  Y Inputs.  In  these  cases,  the  prescribed 
number  of  bits  are  connected  to  the  lessor  significance  bit  positions  and  the 
sign  Is  extended  In  the  MSB  positions.  This  Is  easily  accomplished  by  hard- 
wiring the  higher  order  Inputs  to  the  sign.  Input  buffers  are  provided  on  all 
Inputs  which  exhibit  very  low  external  drive  source  and  sink  requirements. 
Although  the  Input  circuits  are  TTL  compatible  the  source  and  sink  current  re- 
quirements are  typically  less  than  10  tiamps.  Consequently,  no  appreciable 
Increase  In  load  Is  Incurred  by  hardwiring  the  sign  extended  bits. 

Figure  11  Illustrates  the  division  of  the  27  bit  output  register  Into 
two  parts,  a most  significant  product  (MSP)  part  and  an  LSP  part.  The 
MSP  part  Is  16  bits  long  and  has  a separate  3-state  control,  TRIM.  The 
LSP  part  Is  11  bits  long  and  3-state  control  TRIL  enables  the  output  (the  out- 
puts are  enabled  when  these  controls  are  logic  "0").  Many  users  may  only  need 
the  MSP  for  single  precision  operation  and  a 16  wide  data  bus  Is  commonly  used; 
consequently,  the  3-state  controls  were  selected  this  wqy.  The  LSP  outputs  can 
be  hardwired  to  the  same  bus  If  desired.  For  that  matter,  all  Inputs  and  outputs 
can  be  placed  on  the  same  bus  If  desired  and  time  shared  by  use  of  the  3-state 
controls  and  the  Input  register  clock  controls.  This  feature  which  Is  common  to 
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TRWs  LSI  product  line  has  enjoyed  a good  deal  of  user  acceptance  for  all  para* 
lei 1 el  Iterative  processing.  A small  amount  of  overhead  time  sufficient  to 
sequentially  load  the  two  Input  registers  Is  required  when  operated  In  this 
single  bus  mode. 

Returning  to  the  RND  control,  when  In  the  logic  1 state  this  adds  2’12 
to  the  product/ accumulation.  Checking  with  the  number  field  format  shown  In 
this  article,  it  Is  seen  that  this  rounds  off  the  MSP  when  using  single  pre- 
cision results  and  reduces  the  truncation  error.  However,  this  Is  only  true 
If  the  operands  used  have  12  bit  number  fields.  If  this  feature  Is  to  be  pre- 
served for  smaller  input  number  fields,  then  the  inputs  should  be  "left  justi- 
fied" (sign  occupying  the  chip  sign  position  and  zero  filling  the  right  hand 
remaining  positions)  instead  of  the  sign  extended  connection  mentioned  above. 

A few  moments  reflection  on  the  matter  should  convince  you  of  the  rationale 
for  this. 

A cell  layout  diagram  of  the  SPAU  2 Is  shown  in  Figire  14. 

The  actual  cell  layout  has  almost  one-to-one  correspondence 
with  the  block  diagram.  The  AX  labled  cells  are  full  adders  which  form  the 
main  multiplier  array.  Variations  on  the  full  adder  cell  surround  the  central 
core.  The  adder/ accumulator  cells  labeled  ACL  and  ACM  are  shown  on  the  right 
side  and  bottom.  Input  and  output  registers  Interface  directly  with  I/O  bond- 
ing pads  around  the  periphery.  So,  not  only  In  circuit  practice,  but  In  actual 
device  layout  practice  this  multiplier/accumulator  combination  works  out  to  be 

superbly  simple.  Additional  circuit  cells  used  In  the  SPAU  2 over  these  In  SPAU  1 
are  shown  In  Appendix  D. 

The  chip  size  Is  256  x 256  mils  (6.4  x 6.4  mm)  and  has  64  leads  Including 

five  for  power  and  ground.  The  typical  power  dissipation  Is  2.5  watts.  With 

this  power  and  number  of  leads  a 64  lead  DIP  package  with  an  Integral  heat  sink 

with  the  case  temperature  at 

Is  employed.  Operation  In  still  ambient  air/  125°C  for  the  military  version 

of  the  MAC12  Is  practical  without  any  additional  thermal  crutches.  The  number 

of  transistors  and  resistors  Implementing  this  chip  Is  approximately  13,000. 

A microphotograph  of  SPAU  2.  is  shown  In  Figure  15. 
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Figure  14.  Cell  Layout  SPAU  2 


39 
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SPAU  2 Typical  Application 

In  this  section  several .typical  applications  of  the  SPAU  2 are  examined. 
These,  of  course,  represent  a small  subset  of  cases  where  the  chip  can  be 
used  to  advantage.  Host  of  the  examples  take  cases  where  repetitive  high  speed 
arithmetic  Is  to  be  performed  using  a minimum  of  off  chip  control  signals. 

It  Is  Interesting  to  note  the  new  freedom  the  multlpller/accumulator  offers 
to  digital  processing  methods.  When  reviewing  the  literature  on  the  subject  one 
Is  struck  with  the  fact  that  algorithm  methods  Invariably  follow  the  practice  of 
eliminating  as  much  multiplication  processes  as  possible  In  preference  for  add/ 
subtract.  With  the  SPAU  2 both  multiplication  and  simultaneous  addi- 
tion/subtraction Is  the  built-in  feature  representing  the  best  utilization  of 
hardware  for  maximum  operational  throughput.  Therefore  we  suspect  there  Is  a 
large  body  of  untapped  utility  that  the  multiplier/accumulator  organization 
offers  to  the  filter  algorithm  field.  The  availability  of  this  first  product 
innovation, should  provide  the  spark  to  promote  more  efficient  means  for  its 
util ization. 

In  this  section  we  first  explore  the  fundamental  operations  permissible 
with  SPAU  2 and  then  extend  these  to  several  cases  of  sequences  typically 
used  for  digital  filtering  problems. 

A second  very  important  feature  is  the  question  of  retaining  adequate 
accuracy  against  scaling  and  truncation  errors.  Error  analysis  shows  that  in 
most  practical  filter  systems  a 12-bit  number  scheme  can  be  used  if  close 
attention  to  programming  and  double  precision  arithmetic  is  carried  through- 
out. The  SPAU  2 does  all  operations  using  full  double  precision  yet  this 
feature  Is  almost  transparent  to  the  user.  Programming  Is  greatly  simplified. 
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3.5.1  Use  of  SPAU  2 In  the  Multiplier  Mode 

To  emphasize  principle  features  of  the  chip,  we  take  as  the  first  case 
one  where  accumulation  Is  not  desired  and  the  operation  Is  multiplication 
only.  This  Is  also  the  mode  used  for  Initialization  of  the  first  step  used 
In  a sequence  of  multi  plication/ accumulation.  The  format,  shown  In  Table  5, 
lists  the  states  of  the  control  signals  prior  to  clocking  the  Input/ output 
registers.  A single  system  clock  signal  Is  assumed  where  there  Is  a small 
but  defined  source/transmission  delay  between  the  source  of  the  operands  and 
clock  Inputs  to  the  chip.  The  operand  source  delay  plus  the  transmission 
must  be  greater  than  10  ns  and  less  than  irna-Ts,  where  Tma  *s  the  multlpli- 
catlon/accumulatlon  period  and  ts  is  the  Input  register  set-up  time.  This  is 
a conventional  type  of  consideration  for  D type  register  elements  and  guaran- 
tees maximum  operating  speed. 


OPERAND  X 


OPERAND  V 


Figure  16.  Setup  for  Multiplication 


Figure  16  shows  the  connections  for  the  SPAU-2  when  operating  In  the  multi- 
plication mode  only.  Table  5 shows  the  control  states  and  operating  sequence. 
Steps  1 through  4 Illustrate  a typical  Iterative  multiplication  procedure 
showing  a new  product  available  at  the  output  registers  after  each  200  ns. 
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The  ACC  control  is  held  at  logic  zero  for  this  mode  which  disables 
the  feedback  to  the  adder.  The  outputs  to  the  package  pins  go  through  the 
3-state  TTL  buffer  under  the  TRIM  and  TRIL  controls.  TRIM  is  the  3-state 
control  for  the  16  MSP  outputs  and  TRIL  is  the  3-state  control  for  the  11 
LSP  outputs.  Logic  zero  state  for  TRIM  and  TRIL  enables  the  outputs  to  the 
low  impedance  TTL  level  states.  In  case  one  were  to  time  share  the  outputs 
to  the  same  bus, then  TRIM  and  TRIL  would  be  used  to  first  read  one  portion  of 
the  product  out  to  the  bus  and  then  the  other. 

The  sequence  in  Table  5 under  steps  5 through  7 shows  the  result  of  using 
the  RND  control  in  step  5.  As  shown  in  step  6 this  adds  the  quantity  2“12 
(two’s  complement  fractional  notation)  to  the  product.  Typically,  RND  would 
only  be  used  if  one  were  outputting  the  MSP  and  wanted  to  "round"  and  thereby 
reduce  the  error  introduced  by  truncation. 

One  of  the  basic  operations  of  transfering  a number  from  the  input  of 
SPAU  2 to  the  output  is  the  special  case  of  multiplication  where  the  coeffi- 
cient is  unity.  This  is  illustrated  in  Table  5 under  steps  8 through  13. 
Scaling  and  negation  are  also  similar  operations  which  use  an  appropriate  co- 
efficient for  the  multiplier.  The  notation  we  have  adopted  for  the  two's 

• t 

complement  numbers  is  to  treat  the  number  field  as  fractional  with  the  binary 
point  located  immediately  after  the  sign  bit  (for  numbers  input  to  the  multi- 
plier). With  this  notation  a true  +1  unity  value  is  not  quite  available  and 

* 11 

the  closest  quantity  to  it  is  (1*2  ).  On  the  other  hand,  a true  -1  of  unity 

absolute  value  Is  available  and  this  will  negate  the  input  variable  exactly 
and  transfer  to  the  output  as  shown  in  either  steps  12  or  13.  For  those  of 
you  who  are  not  2's  complement  buffs  you  mqy  want  to  review  the  format  notes 
shown  at  the  end  of  this  section.  For  those  of  you  who  are,  we  show  step  14 
where  -1*-1  s 1 which  is  permissible  in  the  SPAU  2 because  extended  signifi- 
cance is  provided  by  additional  adders  in  the  output  circuits. 


Once  an  operand  Is  loaded  Into  the  Input  register  by  the  respective  Input 
clock  the  value  will  be  held  until  changed.  This  Is  Illustrated  In  steps  11 
through  14.  This  concludes  the  multiplier  features  and  we  examine  next  the 
accumulator  mode. 

3.5.2  Use  of  SPAU  2 In  the  Accumulator  Mode 

Table  2 Illustrates  the  multlpl ler/accumul ator  mode  for  the  SPAU  2.  We 
take  as  a first  case  a simple  counter  where  the  modulus  Is  (X^Yj).  Note  In 
line  1 that  the  only  Initialization  procedure  Is  to  place  the  ACC  control  at 
logic  0 and  at  ^ma  time  later  (200  ns)  the  first  count  value  (Xi^Yj  Is  avail- 
able  at  the  output  register.  At  step  2,  ACC*1  Is  loaded  Into  Its  holding 
register.  Successive  accumulation  of  the  counting  modulus  Is  shown  In  steps 
3 and  4. 

The  effect  of  the  SUB  control  Is  shown  In  steps  4 and  5.  SUB,  negates 
the  number  fed  back  to  adder.  Consequently,  the  numbers  added  are 
(X i *Y ! ) -3 ( X i * Y i ) ■ -2(Xi*Yi).  Positive  accumulation  Is  then  restored  at  step 
5 and  the  count  proceeds  In  ascending  order  to  zero  at  step  7. 

In  Table  6 the  dash  marks  are  a "don't  care"  state.  The  controls  ACC, 
SUB,  and  RND  are  registered  controls  loaded  by  CLKX  or  CLKY.  TRIM  and  TRIL 
do  not  effect  the  Internal  operation. 

Steps  8 through  13  show  the  solution  of  an  arbitrary  function: 

F(X,Y.Z,k)  ■ X2-Y2+Z2+KZ-K.  This  would  normally  take  4 additions  and  4 multi- 
plications, whereas  the  SPAU  2 does  this  In  5 steps  of  200  ns  each;  thus, 
computing  the  function  In  one  microsecond  yielding  a full  double  precision 
result. 
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3,5*3  Example  for  Numerical  Integration 


Bated  on  the  area  under  a parabolic  arc.  Simpson's  rule  It  an  accepted  method  for  numerical  Integration  of  a tempi  ad  data 
sequence. 

Let  a function  y(t)  be  templed  at  n + t points,  tuch  that  Vq,  y. yn  art  equally  spaced  at  an  incremental  interval  T. 

Assume  that  n it  even.  Then  according  to  Simpson's  rule  (see  almost  any  calculus  textbook),  the  area  Ac  under  the  curve 
y(t),  given  by 


♦ nT 

V(t)dt. 


may  be  approximated  by 

AS“$<V0*«y,  + 2v2  * 4v3  * 2v4  * • • • + 4Vn-1  + ^nl. 
This  it  generally  more  accurate  than  the  so-called  trapezoidal  rule 
AT  <V0  ♦ 2v,  ♦ 2y2  + . . . ♦ 2yn.,  ♦ yn). 


(1) 

(2) 

(3) 


which  approximates  the  function  y(t)  by  straight-line  segments  and  therefore  fails  to  take  account  of  curvature. 

An  accumulation  of  the  terms  in  Equation  (2),  therefore,  implements  Simpton't  rule  explicitly,  where  It  it  necessary  only  to 
input  the  sequence  of  sampled  points  and  the  appropriate  sequence  of  weighing  coefficients.  After  any  step  m,  where  m<n, 
the  contents  of  the  accumulator  are  X , which  it  an  approximation  to  tha  running  integral  up  to  that  point.  When  m ■ n 
and  the  accumulation  is  terminatad  with  the  proper  weighting  coefficient  (see  Note),  the  evaluation  it  complete  and 
An  “ AS 
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Figure  17.  Numerical  Integration 


TABLE  7.  SEQUENCE  FOR  NUMERICAL  INSERTION 


| INPUT 

CONTROLS 

OUTPUT  REGISTER 

Y 

ACC 

SUB 

RNO 

CIK  X 

CLK  Y 

CLK  P 

TRIM 

tril 

m 

T/3 

B 

b 

B 

n 

n 

0 

0 

0 

EB 

4T/3 

B 

9 

B 

n 

n 

J1 

0 

0 

Y0T/3 

Y2 

2 T/3 

i 

9 

B 

n 

n 

JT 

0 

0 

Y„  T/3  ♦ 4 Y,  T/3 

Y3 

4 T/3 

i 

0 

1 

n 

n 

0 

0 

Y„T/3  ♦ 4 Y,  T/3  r 

2 Y2  T/3 

m 

T/3 

i 

9 

B 

n 

n 

n 

0 

0 

• 

• 

B 

9 

B 

• 

JT 

0 

0 

A - E (Tj  Yl 

NOTE:  To  avoid  termination  error,  baaed  on  Simpaon'a  rule  outlined  above/  tha  integration  ahould  terminate  on  an  odd 
number  of  samples  (n  even)  with  a weight  of  T/3,  at  shown.  If  It  it  necessary  to  terminate  on  an  even  number  of  samples 
(n  odd)  then  it  it  a good  approximation  to  keep  tha  sequence  up  to  that  point  and  terminate  with  a weight  of  2T/3. 
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3.5.4  txample  for  Complex  Multiplication 


Given  two  complex  numbers  Zj  and  Z2  and  multiplying  these  we  have 
Zi-  Xl+JYl 

Z2»  X2+jY2 

ZX*Z2=  Xi*X2-Y1*Y2+j(Xi*Y2+X2*Yj ) 

= X + jY 

A diagram  for  this  operation  Is  shown  in  Figure  18. 


X2  — t X2 


X,  -x2 


X,-Yj 


VY1 


X2*Y1 


PRODUCT 


X-X,  ‘Xj-Y,*  Yj 


IMAGINARY 


Y - X,  * Yj  ♦ Xj  * Y, 


Figure  18.  Complex  Multiplication 


TABLE  8.  SEQUENCE  USING  ONE  SPAU  2 


CONTROLS 


ACC  I SUB  I RNdI  CLK  X I CLK  Y I CLK  P | TRIM  | TRIL  | OUTPUT  TERMINALS 


X,  Yj 


X2  I V1 


JT  Jl  0 ' _* I ' 


n jt  \n  i « 


JT  JT  Jl 


ru  n I jt  i 1 i ’ 


Jl 


X-X,  *x2  Y,  • y 


3 


+ X2  • Y 


To  compute  this  using  disjoint  adder-multiplier  means  requires  4 multlpli 
cations  and  2 additions  for  6 operations.  Here  we  see  that  one  SPAU  2 in 
four  steps  computes  the  complex  product  In  800  ns.  It  Is  clear  that  using  two 
chips  , one  operating  on  the  reals  and  the  other  the  Imaglnarles,  would 

compute  In  400  ns.  48 


3.5.5  Example  for  Discrete  Fourier  Transform  Solution 

Solution  of  the  DFT  was  shown  In  Section  2.2.  The  connections  and 
tabular  listing  of  the  controls  are  quite  simple  with  the  SPAU  2.  These 
are  shown  In  Table  9 and  the  diagram  Figure  19. 


COS  n 0k  o-O  1.  2. ...  N-  1 


I n - 0.  1,  2 N ■ 1 

n 


SIN  00^  n-0, 1,2. 


I 1 


Figure  19.  Reals  Solution  Only,  Imaginary  Solution 
Is  Similar,  DFT 


TABLE  9.  DFT  SEQUENCE 


j INPUT 

CONTROLS 

OUTPUT  REGISTER 

X 

Y 

ACC 

SUB 

RND 

CLK  X 

CLK  Y 

CLK  P 

TRIM 

TRIL 

’o 

1 

0 

0 

0 

n 

n 

J~L 

1 

1 

*1 

cosak 

1 

0 

0 

n 

n 

n 

1 

1 

*0 

»2 

COS  2«k 

1 

0 

0 

n 

n 

n 

t 

1 

»o  + ,1COS\ 

1 

1 

1 

1 

1 

i 

i 

1 

Vi 

COS 

in  nak 

1 

0 

0 

J~L 

1 

1 

»o  ♦ 1,  COS0k  ♦ 

fN  ,2  COS  (N  -2)Bk 

• 

• 

• 

* 

0 

0 

n 

0 

0 

l0  + 1,  COS9k  ♦ 

»N  , COS  (N-  t)«K 

NS 

0 

200 

400 


N * 200 


TRANSFER  TO  DESTINATION 
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The  arithmetic  operation  discussed  In  section  2.3,  solution  of  the 
complex  arithmetic  In  connection  with  FFT  problems  finds  a ready  answer 
using  the  SPAU  2. 


For  solution  of  the  kernal  or  butterfly  we  show  the  diagram  In  Figure 
20.  This  represents  the  DFT  solution  of  two  points  which  have  been 
paired  by  the  FFT  scheme  for  time  to  frequency.  The  complex  FFT  points  are 
represented  by: 

1st  Sample  + jY^ 

2nd  Sample  X2  + jY2 

0 

These  are  operated  upon  by  the  rotation  coefficient  functions  sin 
and  cosb.  The  particular  solution  we  are  implementing  is  shown  below.  The 
primed  quantities  represent  the  transformed  values. 


X1+X2cose  + Y2sin6  * X1+Z1  * Xj 
Y^-X2sine  + Y2cosu  * Y1+Z2  * Yj 


Figure  20.  Diagram  for  Solution  of  Butterfly 


TABLE  10.  SEQUENCE  FOR  KERNEL  SOLUTION  USING  ONE  SPAU  2 


INPUT 

CONTROLS 

CONTENTS 

OF  OUTPUT 
REGISTER 

X 

V 

ACC 

SUB 

RND 

CLK  X 

CLK  Y 

CLK  P 

TRIM 

TRIL 

x2 

cose 

0 

0 

0 

n 

n 

* 

0 

0 

X2 

SIN  0 

1 

0 

0 

n 

n 

JT 

0 

0 

x2  cose 

X1 

1 

1 

0 

0 

n 

JT 

JT 

0 

0 

x2cose+Y2siNe 

X1 

1 

1 

1 

0 

n 

n 

n 

0 

0 

x1+x2cose+Y2* 

siNe-x', 

X1 

1 

1 

0 

0 

n 

n 

JT 

0 

0 

21 

X2 

SIN  8 

0 

0 

0 

n 

n 

JT 

0 

0 

VZ,*X-2 

Y2 

cos  e 

1 

1 

0 

n 

n 

JT 

0 

0 

X2  SIN 

Y1 

i 

1 

0 

0 

n 

J1 

JT 

0 

0 

•XjSiNe+YjCose 

■ 

i 

1 

1 

0 

n 

JT 

JT 

0 

0 

Y^XjSINe  + Yj* 

cose-Y', 

Y1 

1 

0 

0 

n 

JT 

JT 

0 

0 

Z2 

- 

- 

- 

• 

• 

0 

0 

JT 

0 

0 

Y1Z2'Y'2 
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For  this  implementation,  10  steps  of  200  ns  each  for  a total  period 
of  2ns.  is  required.  The  sequence  can  be  arranged  to  use  -1  as  the  multi- 
plier. The  twelve  bit  2's  complement  fractional  number  field  Includes 
1-211  but  not  1.  Alternative  schemes  which  use  a partial  integer  field 
also  avoid  error,  but  would  have  to  accept  a loss  In  accuracy  due  to 
representing  a quantity  such  as  cos®  with  less  significant  digits. 


For  the  case  where  two  SPAU  2 were  to  be  used  it  is  apparent  that 

■ • 

only  five  steps  are  required,  one  computing  X1  and  Xg.  while  the  other  computes 
Y1  and  Yg. 

3.5.7  Non-Recursive  Filter  Implementation 

The  DFT  and  the  butterfly  are  examples  of  non-recursive  filters  since  the 

outputs  do  not  operate  on  the  inputs  in  a feedback  mode.  Mare  typical  cases 
of  non-recursi ve  filter  implementation  with  SPAU  2 are  discussed  in  this 


section. 


The  non-recursive  filter  is  shown  in  Figure  21.  This  has  the  transfer 


function  H(z). 


b)( ? ) 5“  f h,  2 1 + ^z.  £ ^3  2 


The  h’s  are  weighting  coefficients  applied  to  delayed  sample  data,  where 

-1  -2 
z represents  one  sample  period  delay,  z represents  two  sample  period 

delays,  etc.  The  zeros  of  this  transfer  function  can  be  found  by  setting 

H(z)  * 0 and  solving  the  polynomial  for  z.  Filters  of  this  type  can  be  used 

for  smoothing  noisy  data,  windowing  the  DFT,  and  as  a part  of  more  complex 

filters.  A block  diagram  is  shown  in  Figure  21. 
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Figure  21.  Non-Recursive  Filter 

To  Implement  the  filter  using  one  SFAU  2,  the  Input  data  Xt  Is  held 
at  one  of  the  Inputs  for  the  duration  of  the  filter  delay,  four  periods  for  the 
Figure  21  case.  Successive  operation  by  the  weighting  coefficients  and  cumu- 
lation of  the  products  Is  then  stored  In  the  output  register.  The  sequence 
Is  shown  In  Table  11. 


TABLE  11.  SEQUENCE  FOR  SOLUTION  OF  NON-RECURSIVE  FILTER 
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Where  higher  speed  Is  required,  four  SPAU  2 operating  In  a parallel 
mode  with  sequential  multiplexing  to  a common  output  bus  can  be  used.  This 
Is  shown  in  Figure  22. 


Figure  22.  Non-recursive  Filter,  Parallel  Mode 


An  external  four-stage  circulating  shift 

register  holds  the  weighting  functions,  h.  A tag  bit  is  also  circulated  in  the  h shift  registers  which  operates  the  3-state  control, 
thereby  busing  the  accumulator  contents  to  the  output  in  sequence.  Input  register  clocking,  output  register  clocking  and 
h shift  register  are  all  operated  directly  from  the  5 MHz  system  clock.  The  control  ACC  is  also  operated  from  the  tag  bit  as 
well  as  the  3-state  control.  As  can  be  traced  from  the  block  diagram,  each  SPAU  2 accumulates  four  products  and  then  is 
gated  to  the  output  bus.  On  the  next  clock  period  the  adjacent  SPAU  2 is  outputted,  etc.  By  these  means,  steady  outputs 
at  the  5 MHz  rate  are  sustained.  The  latency  period  is  four  clock  periods  or  800  ns.  The  internal  operation  sequence  for  any 
one  SPAU  2 is  the  same  as  shown  in  the  previous  implementation. 
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3.5.8  Recursive  Filter  Implementation 

The  usual  block  diagram  for  the  simplest  recursive  digital  filter  is 
shown  In  Figure  23. 


Figure  23.  Simple  Recursive  Filter 

The  MAC  organization  has  internal  feedback  for  product  accumulation  but 
it  Is  not  provided  with  output  to  multiplier  Input  signal  connections  which 
the  recursive  filter  requires.  However,  externally  the  output  can  be  bused 
to  the  Input  as  shown  In  Figure  24.  The  3 state  output  controls  and  the 
Input  register  clock  controls  effect  time  sharing  the  inputs  and  outputs. 
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Figure  24.  Single  SPAU  2 Part  Implementation  of 
Single  Pole  Filter 

In  this  case,  two  step  operation  is  used  where  the  input  into  the 
multiplier  Is  alternated  between  1 and  weight  coefficient  h and  the  output 
is  read  out  on  alternate  clock  cycles.  Operation  is  shown  below,  assuming 
initial  output  accumulator  contents  of  value  V-  A full  multiply  period  is 
used  to  allow  transfer  from  the  output  to  the  inpdt.  This  assumes  a single 
phase  clock  system  is  used.  More  efficient  utilization  for  this  type  of  oper- 
ation can  be  realized  using  a multiphase  clock  system,  however,  for  simplicity 
In  displaying  the  sequence  and  In  common  with  the  other  sequences  presented 
the  single  clock  method  is  assumed  for  the  operation  shown  in  Table  12. 
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TABLE  12.  RECURSIVE  FILTER  OPERATION  USING  SPAU  2 
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For  a two  or  more  pole  filter  we  have  the  diagram  shown  in  Figure  25. 
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Figure  25.  Two  or  More  Pole  Recursive  Filters 

Using  the  methods  outlined  the  multi  pole  filter  as  well  as  poles  and  zeros 
combinations  can  be  implemented  using  the  multiplier-accumulation  organization. 
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NOTE  1.  When  nonaccumulating,  all  four  MSB  will  indicate  the  sign  of  the  product.  The  PR-0  term  will  also  indicate 
the  sign  except  for  the  one  exceptional  case  when  multiplying  -1  • -1.  Note  that,  with  the  additional 
significant  bits  available  on  this  multiplier,  —1  • —1  is  a valid  operation  yielding  a *1  product. 

NOTE  2.  There  is  no  change  in  the  format  whether  one  is  accumulating  the  sum  of  products  or  simply  doing  single 
products.  However,  the  three  additional  most  significant  bits  are  provided  to  allow  valid  summation 
beyond  that  available  for  a single  multiplication  product.  For  further  clarification,  no  difference  exists 
between  this  organization  and  one  which  would  have  the  product  accumulation  off  chip  in  a separate 
27-bit  wide  adder.  Taking  the  sign  at  the  most  significant  bit  position  guarantees  that  the  largest  number 
field  will  be  used.  In  operation  the  sign  will  be  extended  into  the  lesser  significant  bit  positions  when  the 
accumulated  sum  only  occupies  a right  hand  portion  of  the  accumulator.  As  an  example,  when  the  sum 
only  occupies  the  least  three  bit  positions  then  the  sign  will  be  extended  through  the  24  most  significant 
positions. 

The  latter  factor  allows  one  to  detect  imminent  overflow/underflow  should  this  be  desired.  Using  an  off 
chip  exclusive  OR  gate  connected  to  the  sign  and  the  next  most  significant  bit  will  flag  imminent  overflow/ 
underflow.  When  the  two  inputs  are  different,  the  exclusive  OR  gate  goes  to  a logic  one  state.  In  this  case 
four  more  multiply  accumulate  cycles  would  be  allowable  without  overflow/underflow,  but  a fifth  could 
possibly  cause  overflow/underflow  depending  upon  the  magnitude  of  the  sum  steps. 

NOTE  3.  Format  is  shown  using  a 2s  complement  fractional  notation.  In  this  notation  the  location  of  the  binary 
point  signifying  separation  of  the  integer  and  fractional  fields  is  just  after  the  sign,  between  the  sign  and 
the  next  most  significant  bit  for  the  multiplier  inputs.  This  scheme  is  carried  over  to  the  output  format, 
except  that  an  extended  significance  to  the  integer  field  is  provided  (to  extend  the  utility  of  the  accumu 
lator).  Consistent  with  the  input  notation  the  output  binary  point  rs  located  between  the  PR-0  and  PR  1 
bit  positions  (for  the  nonaccumulate  mode).  For  the  accumulate  mode  the  binary  point  position  is  the 
same  between  the  S*0  and  S-1  bit  positions. 

It  is  arbitrary  where  the  binary  point  is  considered  located  as  long  as  one  is  consistent  with  both  input  and 
output  formats  One  can  consider  the  number  field  entirely  integer,  • e . with  the  binary  point  |ust  to  the 
right  of  the  least  significant  bit  for  input,  product,  and  accumulated  sum 


3.5.10  Conclusions  on  Applications  of  SPAU  2 

It  was  shown  in  these  sections  two  different  approaches  to  solutions 
of  arithmetic  algorithms  using  SPAU  1 and  SPAU  2.  It  is  evident  that  SPAU  2 
is  generally  superior  and  manifestly  simpler  to  program.  For  these  reasons, 
the  design  during  the  latter  period  of  this  contract  concentrated  on  the 
SPAU  2 and  made  debugging  iterations  on  the  layout  to  satisfy  all  problems 
and  meet  specifications,  a complete  commercial  specification  is  the  subject 
of  Appendix  C.  This  specification  is  listed  under  TRW  house  number  TDC  1003J. 

Packaging  of  the  SPAU  2 changed  from  the  64  lead  flat  pack  used  on  SPAU  1 
to  a heat  studded  64  lead  DIP.  This  package  arrangement  allows  the  SPAU  2 to 
operate  in  still  air  at  a case  temperature  of  125°C. 

Processing  yield  measured  on  this  256  by  256  mil  chip  is  approximately 
40*  at  wafer  probe  test.  This  corresponds  to  approximate  defect  density  in 
processing  to  2.5  defects/cm2,  DQ.  We  feel  this  most  complex  chip  of  the  set 
has  demonstrated  all  original  goals  and  was  brought  to  a complete  manufacturing 
state. 

The  industry  should  profit  from  the  organizational  structure  of  the  SPAU  2. 
We  feel  it  serves  as  a model  for  Innovative  high  speed  digital  filter  applica- 
tions. One  lesson  it  teaches  that  goes  beyond  the  hardware  accomplishment  is 
that  multiplication  and  accumulation  can  be  considered  simultaneous  operations 
rather  than  disjoint  as  normally  practiced.  Also,  multiplication  and  addition 
are  equal  weighted  operations. 

This  means  that  multiplications  do  not  have  to  be  minimized  in  preference 
to  addition  as  commonly  considered  standard  practice  in  the  field.  The  hope  is 
that  this  new  freedom  will  allow  designers  to  invent  more  efficient  algorithms. 
Only  time  and  full  exploitation  of  the  SPAU  2 concepts  will  tell. 
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SECTION  IV 

SIGNAL  PROCESSING  DELAY  LINE,  SPOL,  DESIGN 

The  Signal  Processing  Delay  Line  (SPDL)  Is  a grouping  of  shift  registers 
primarily  designed  to  facilitate  address  control  of  an  FFT  pipeline  processor. 

It  also  has  general  purpose  application.  The  logic  diagram  In  Figure  26 
shows  the  organization  of  this  chip.  The  upper  half  Is  a six-bit  wide  and 
five-bit  long  serial  shift  register  with  provision  for  multiplexing  to  the 
output  either  the  third  or  fifth  signal.  Input  multiplexing  of  either  A or  B 
Input  signals  Is  also  provided.  The  lower  half  Is  the  same  except  for  output 
multiplexing  of  the  first  and  fifth  bit  signals.  All  Inputs  and  outputs  are 
TTL  compatible  and  the  outputs  have  tri-state  control. 

The  shift  registers  are  all  D type  and  edge  triggered  on  the  leading 
clock  edge.  The  Inputs  are  level  shifted  from  TTL  to  CML  levels  and  the 
registers  are  Implemented  using  CML  circuits.  The  output  circuit  Includes 
output  multiplexing  at  CML  levels  and  level  shifting  back  down  to  TTL. 

4.1  SPDL  Layout  and  Circuit  Cells 

The  layout  design  of  the  SPDL  followed  the  practice  used  for  the  SPAU 
1 and  2.  This  circuit  Is  a very  "regular"  logic  function  and  would  offer  no 
problems  to  a full  CAD  approach  to  the  layout.  However,  for  expediency,  since 
the  cell  family  using  CML  logic  Is  a relatively  new  usage  for  the  3D  technology 
and  had  not  received  the  formal  attention  normally  required  for  standard  cells, 

It  was  laid  out  using  th6  custom  cell  method  with  Interactive  graphics  aids. 

The  layout  plan  Is  shown  In  Figure  27.  This  chip  Is  packaged  In  a 
40-lead  ceramic  dip  and  the  pin  locations  are  noted  Inside  of  circles  on  the 
figure.  The  overall  chip  size  Is  under  150  x 150  mils  and  contains  2322  devices. 
The  circuit  cells  are  shown  In  Figures  28  through  31. 

4 . 2 SPDL  Specifications 

The  complete  specifications  for  the  SPDL  are  given  In  Appendix  E. 
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SECTION  V 
SPAC-1  DESIGN 

The  Signal  Processing  Arithmetic  Control  (SPAC)  Is  the  third  type  chip 
developed  under  this  program.  This  Is  a special  purpose  design  to 
generate  address  control  logic  for  FFT  applications. 

Note  In  Figure  32  the  general  connection  schematic  for  the  solution  of 
the  butterfly  operation.  The  two  Inputs  being  complex  quantities  are  called 
vectors.  One  Is  assigned  the  term  the  "upper  vector"  and  the  other  the 
"lower  vector".  Each  of  these  are  fetched  from  memory  address  locations. 

One  purpose  of  the  SPAC  1 Is  to  generate  these  addresses  called  upper  vector 
address  and  lower  vector  address.  It  Is  assumed  the  kernel  Is  done  In  place 
meaning  that  the  Input  vectors  are  transformed  to  upper  and  lower  output 
vectors  which  are  loaded  back  to  the  same  memory  addresses.  The  other  func- 
tion of  SPAC  1 Is  to  generate  an  address  for  a sine/cosine  ROM  which  supplies 
the  rotation  coefficient  s1n(e)  and  cos(^). 


Upper  vector  address  of  vector,  Xj+jYi 
Lower  vector  address  of  vector,  x2+jY2 
Rotation  coefficient  address  of  ROM  for  s1n(e)  and  cos(y) 
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The  purpose  of  the  Address  Control  Logic  Chip  is  to  generate  all  the  ad- 
dresses automatically  for  any  size  FFT  desired.  The  size  of  the  FFT  Is  simply 
loaded  Into  the  circuit  during  the  preset  cycle.  Table  13,  and  then  It  outputs 
a different  set  of  addresses  after  each  clock  pulse  following  the  algorithm  in 
Table  14.  The  addresses  generated  are  the  upper  vector  address  to  the  butter- 
fly, the  lower  vector  address  and  the  rotation  coefficient  address.  The  SPAC  1 
is  implemented  in  a five-bit  slice  and  requires  two  chips  for  a 1024  point  FFT. 
The  circuit  is  designed  for  a 40-pin  flatpack.  It  requires  no  control  except 
for  a preset  for  initialization.  The  circuit  normally  runs  at  1/4  the  clock 
rate  of  the  SPAU  1 butterfly  due  to  the  fact  that  a new  set  of  addresses  are 
generated  by  each  clock,  which  are  then  used  during  four  cycles  of  the  SPAU  1. 
Figure  33  illustrates  the  FFT  Address  designations  at  different  times.  Each 
spread  is  separated  into  an  upper  and  lower  address  block  from  which  a "butter- 
fly" is  loaded  as  inputs  and  to  which  the  results  are  returned.  After  the  com- 
pletion of  each  butterfly  operation,  the  addresses  are  incremented  until  the 
end  of  the  spread  or  level.  The  actual  sequence  of  addresses  by  the  LSI  are 
shown  in  Figure  33.  A flow  chart  (Figure  34)  explaining  the  operation  of 
the  hardware,  follows.  The  block  diagram  of  the  SPAC  1 logic  is  shown  in  Figure 
35.  The  block  diagram  in  Figure  36  is  the  same  except  that  logic  is  arranged 
into  five-bit  cells. 

The  SPAC  1 design  was  completed  through  the  simulation  and  layout  stage, 
but  not  masked  or  fabricated.  This  was  a custom  design  with  3500  devices  and  a 
chip  size  of  226  x 218  mils.  Due  to  the  extensive  feedback  in  the  counter  logic 
the  chip  was  less  efficient  than  found  in  the  other  custom  designs.  Experience 
with  logic  of  this  sort  found  in  SPAC  1 teaches  that  considerable  rework  is  often 
necessary  to  completely  debug  the  chip.  For  these  reasons,  a simpler  chip  using 
a computer  automated  approach  was  selected  for  meeting  these  requirements.  This* 
was  the  reason  for  completing  the  hardware  delivery  using  the  SPAC  2 design  taken 

up  in  the  next  section. 
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5.1  SPAC  1 Specification  and  LSI  Details 
These  are  shown  In  Appendix  F. 
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5.2  SPAC  2,  Design 

Hardware  deliveries  were  completed  using  a simplified  alternative  to  the 
custom  design  SPAC  1 chip.  This  design  is  called  SPAC  2 and  it  is  made  from 
a universal  array  LSI,  TRW's  Configurable  Gate  Array,  CGA.  This  design  has 
the  following  features: 

a.  Brings  another  LSI  technique,  the  gate  array  approach, 
to  the  LSI  Implementation  program. 

b.  Offers  a flexible  approach  for  application  changes  where 
different  algorithms  for  address  generation  in  FFT's  may 
be  desired  such  as  frequency  to  time  transformations. 

A block  diagram  of  a complete  FFT  address  generator  using  the  SPAC  2 is 
shown  in  Figure  37.  Each  SPAC  2 incorporated  a five-bit  address  slice, 
32-point  FFT.  Using  two  chips  provides  a 10-bit  slice  for  1024  point  FFT. 

In  addition  to  the  SPAC  2 chips  other  MSI  chips  for  adders  and  counters, 
approximately  16  additional  chips  are  required.  Using  an  all  MSI  approach, 
approximately  50  chips  are  required,  so  the  reduced  package  count  saving  is 
quite  good. 

The  SPAC  2 is  a controller  to  other  logic  as  shown  in  Figure  37.  The 
design  of  SPAC  2 is  based  on  shift  register  logic  of  a type  similar  to  that 
used  in  the  SPAC  1.  The  logic  diagram  for  SPAC  2 is  shown  in  Figure  39. 

The  similarity  can  be  seen  by  comparing  this  to  a row  logic  section  of 
Figure  5.5. 

Figure  38  shows  a block  diagram  for  the  SPAC  2.  A one-bit  register 
element  for  the  coefficient  is  shown  in  Figure  40,  and  the  corresponding 
vector  address  is  shown  in  Figure  41.  There  are  approximately  26  gates 
used  per  bit  slice  for  a five  bit  slice  of  129  gates.  This  fits  well  into 
the  158  gate  complement  and  40-lead  package  normally  used  for  the  CGA  approach. 
The  gate  placement  and  wiring  diagram  for  the  SPAC  2 CGA  chip  is  shown  in 
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Figure  42.  This  drawing  is  outputted  from  the  CGA  approach  and  is  used 
to  check  the  interconnections.  There  is  one  to  one  correspondence  with  this 
and  the  logic  diagram  shown  in  Figure  42.  Gate  numbers  and  pad  numbers  are 
the  same  in  both  drawings. 

Appendix  G is  a summary  of  the  gate  specifications  and  the  design  infor- 
mation of  TRW's  Configurable  Gate  Array. 
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SECTION  VI 
LSI/VLSI  TECHNOLOGY 


i 


The  technology  used  through  this  study  and  used  to  design,  fabricate,  and 
deliver  packaged  chips  under  this  program  is  the  triple  diffusion,  3D,  tech- 
nology. This  was  used  for  the  custom  designs  as  well  as  the  universal  array 
design.  Table  15  lists  the  type  and  kind  of  LSI  chip  along  with  size,  number 
of  devices,  and  device  packing  density  for  all  chips  delivered  under  this 
program. 

TABLE  15.  SALIENT  FEATURES  OF  LSI  CHIPS 

TYPE  DESIGNATION  CHIP  SIZE  NO.  OF  AREA/DEVICE 

mils  x mils  DEVICES  m11s2/device 

Custom  Random  Logic  SPAC  1 226  x 218  3500  14.1 

Custom  Uniform  Logic  SPAU  1 351  x 315  15,000  7.4 

Custom  Uniform  Matrix  SPAU  2*  256  x 256  13,000  5.1 

Logic 

Custom  Uniform  Matrix  MPY  16*  278  x 278  18,000  4.3 

Logic 

Custom  Uniform  Linear  SPDL  115  x 134  2,280  6.8 

Array 

Configurable  Gate  Array  SPAC  2 208  x 208  37.3 

1,160 

*Uses  reduced  geometry 

Reviewing  Table  15,  the  most  densely  packed  designs  are  the  uniform 
matrix  or  linear  type  arrays.  Most  of  the  chips  use  4 micron  space  and  feature 
dimensions,  but  SPAU  2 and  MPY  16  use  3 micron  dimensions  and  this  accounts  for 
the  better  packing  density.  The  smaller  chips  suffer  from  this  type  of  calcu- 
lation because  the  total  chip  area  from  scribe  line  to  scribe  line  is  taken. 

For  smaller  chips  more  area  is  used  in  the  curf  for  pads  and  things. 

The  SPAU  1 although  listed  as  uni  from  has  a good  deal  of  control  logic. 

If  this  were  made  with  reduced  geometry  compared  with  SPAU  2 the  device  density 
would  be  approximately  the  same,  5.5  versus  5.1  mils2/ device  compared  to  the 
SPAU  2. 
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6.1  3D-LSI  Description 


The  triple  diffusion  (3D)  bipolar  Integrated  circuit  technology  was  Introduced 
In  the  early  1960's  to  meet  the  Impending  need  for  device  Integration  on  a monolithic 
silicon  chip.  ' It  offered  the  most  direct  and  simplest  means  of  producing  elec- 
trically Isolated  transistors  and  resistors  on  a chip.  This  remains  true  today. 

With  this  technology,  NPN  transistors  and  resistors  are  self-isolated  by  PN  junctions. 

PNP's  are  vertical  devices  with  collectors  common  with  the  substrate.  The  resistors 
are  N type,  either  collector  or  collector-pinched.  With  the  advent  of  LSI,  the  3D 
technology  was  revived  to  Implement  medium  speed,  typically  5 to  30  MHz,  high  com- 
plexity chips  with  2,000  to  20,000  devices.  A large  number  of  LSI  designs  have 
been  produced  with  sizes  ranging  from  100  x 100  mils  upward. 

The  reason  for  using  3D  Instead  of  the  more  prevalent  epitaxial  method  lies  In 
the  higher  produclbl 1 1 ty  and  device  density  obtainable.  Although  present  3D  practice 
produces  devices  with  lower  alpha  cutoff  frequencies  than  epitaxy  devices,  this  has 
not  resulted  in  a particular  handicap  for  medium  speed  applications. 

The  circuit  performance  within  an  LSI  depends  upon  optimization  of  the  circuit, 
given  the  device  characteristics  derived  from  the  physical  structure.  The  other  way 
around  would  be  to  tailor  the  process  to  meet  the  circuit.  Considering  the  wide  ver- 
satility of  circuit,  engineering  compared  to  what  can  be  done  In  a practical  manner 
with  physical  processes,  it  seems  wiser  to  tailor  the  circuit.  This  has  been  the 
course  followed,  which  has  led  to  not-so-conventlonal  circuits  that  better  use  the 
3D  device  properties.  Several  types  of  these  are  discussed. 

LSI  systems  employ  MSI  and  SSI  In  addition  to  the  basic  LSI  devices.  Custom 
3D-LSI  chips  seldom  stand  alone,  but  must  electrically  Interface  with  LSI/MSI/SSI 
chips  In  order  to  achieve  cost  effectiveness.  As  a result,  two  different  criteria 
are  adopted,  one  for  the  Internal  logic  and  the  other  for  the  Input/output  Interface. 

6.2  Triple  Diffusion  Process 

Since  there  mey  be  a confusion  of  names  concerned  with  technologies  of 
similar  types,  we  define  the  triple  diffusion  process  to  mean  a process  for  bipolar 
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transistors  of  both  typos  (NPN  and  PNP)  and  self-isolated  resistors.  This  process 
consists  solely  of  an  Impurity  deposition  and  distribution  taken  three  times  In 
sequence  (or  by  fewer  steps  using  advanced  processing  means).  Figure  43  shows  the 
plan  and  cross-sectional  view  of  present  triple  diffused  devices. 
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Figure  43.  Triple  Diffusion  Structure 

The  silicon-substrate  Is  prepared  by  oxidation,  followed  by  conventional  photo- 
resist mask  and  etching  steps  to  delineate  the  "collector”  areas.  Impurity  doping 
Is  done  by  phosphorous  Ion  Implantation,  followed  by  a thermal  distribution  diffusion. 

The  base  and  emitter  regions  follow  In  sequence  and  are  prepared  In  a similar  way. 
Intraconnections  are  maae  by  etching  electrode  contacts  through  the  protecting  oxide 
and  depositing  T1-A1  metal.  Finally,  the  metal  Is  etched,  and  the  surface  Is  covered 
with  a passivating  surface  oxide. 

Figure  43  shows  a PNP  transistor  coalesced  or  merged  Into  the  NPN  transistor 
Also,  the  resistor  terminates  on  the  collector  of  the  assembly  and  Is,  In  fact,  a 
simple  extension  of  the  collector  region.  This  property  of  coalescing  structures 
having  a common  potential  In  a circuit  reduces  the  Intraconnection  complexity  and 
does  much  to  enhance  the  device  density  obtainable  with  3D. 
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Figure  44  Is  a more  detailed  transistor  corss-sectlon  showing  the  principal 
steps  In  the  process.  Also,  the  N+  top  collector  contact  ring  is  shown.  This  Is 
necessary  In  order  to  keep  the  collector  spreading  resistance  low.  Guard-ring  con- 
struction Is  used.  This  Is  shown  as  an  outer  P+  diffusion  around  the  device; 
actually,  the  entire  field  Is  P+,  except  for  cutouts  where  transistors  or  other 
devices  are  fabricated. 
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Figure  44.  Triple  Diffusion  Process 


6.3  30  DEVICE  PROPERTIES 


The  30  process  uses  P type  silicon  substrate  as  an  active  device  region  as 
opposed  to  the  weak  Interaction  mode  normally  processed  Into  IC's.  Consequently, 
the  transistor  parameters  must  Include  this  additional  effect.  PNP's  are  formed 
from  the  P+  "base,"  N "collector,"  and  P substrate  as  shown  In  Figure  45.  These 
are  vertical  operating  devices  with  the  collector  common  to  the  substrate  and  can 
be  used  only  In  a common  collector  connection.  The  N type  transistors  are  four 
region  NPNP's.  This  actually  Is  a small  departure  from  conventional  practice,  and 
the  only  additional  factor  to  consider  Is  the  Inclusion  of  an  additional  branch  In 
the  transistor  equivalent  circuit  for  base  overdrive  current  removal  by  PNP  action 
to  the  substrate. 

Certain  types  of  connections,  like  the  saturated  coupling  transistor  in  TTL, 
operate  differently  In  the  NPNP  mode.  This  Is  not  used  In  3D,  since  practically  all 
base  overdrive  current  (a  factor  of  ap)  would  be  removed  by  the  substrate.  On  the 
other  hand,  operating  a grounded  emitter  NPNP  with  base  overdrive  current  causes  the 
transistor  to  "saturate"  In  a very  limited  way.  The  overdrive  base  current  Is 
removed  by  active  PNP  action  to  the  substrate.  Consequently,  the  storage  time  Is 
small  (ts  » 4 nsec);  the  charge  represented  by  the  Induced  base  charge  of  the  under- 
lying PNP  transistor.  However,  there  are  simple  modifications  to  the  TTL  input 
coupling  transistor  (which  are  shown  later)  that  allow  compatibility  with  the  30 
process. 

The  lateral  PNP  transistor  used  In  weak  Interaction  technologies  Is  not  available 
In  30.  This  Is  because  the  strong  vertical  diffusion  gradients  restrict  transistor 
action  to  the  Immediate  vicinity  of  the  device.  This  works  to  the  advantage  of  30  by 
allowing  unrestricted  coalescing  of  NPNP's  with  PNP's. 

This  reduced  lateral  transistor  action  observed  in  3D  technology  has  a benefit 
with  respect  to  the  latch-up  phenomena.  Latch-up  Involves  PNPN  action  (not  NPNP)  for 
negatively  biased  substrates  and  positive  biased  circuit  elements.  Because  of  lateral 
transistor  action  In  the  weak  substrate  interaction  technologies,  certain  spacing 
rules  and  guard  ring  techniques  must  be  used  to  avoid  this  problem.  The  only  precau- 
tion for  3D  Is  to  ensure  a well-grounded  substrate  (requires  a bacjc  wafer  metallization). 

Typical  3D  device  and  structure  parameters  are  shown  in  Table  16.  For  circuit 
simulation,  these  are  used  with  a modified  Ebers-Moll  model.  Figure  45  is  a compila- 
tion of  NPNP  and  PNP  models.  These  models  are  found  to  be  adequate  (not  more  than  10% 
error)  for  frequencies  up  to  approximately  30%  of  the  f rating  of  the  devices.  For 
high  current  operation,  single  "standard"  3D  transistors  are  paralleled  rather  than 
using  Interdigitization  of  base  and  emitter  strips.  This  departure  from  epitaxial 
practice  Is  necessary  because  the  collector  spreading  resistance  determined  by  a top 
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TABLE  16.  30  DEVICE  STRUCTURE  AND  RELATED  PARAMETERS 


f 

Units 

Value 

Substrate  (P  type) 

Reslstlvl  ty 

ohm- cm 

0.8 

Concentration 

3 

per  cm 

2.5xl016 

Collector  (N  type) 

9.6xl017 

Surface  concentration 

per  cm"* 

Sheet  resistance 

ohms/O 

100 

Junction  depth 

um 

4.8 

Average  mobility 

2 

cm  /volt- 

sec  38C 

Average  mobl  lity 

2 

cm  /volt- 

sec  441 

under  base 

Base  (P  type) 

1.2xl019 

Surface  concentration 

per  cm"* 

Sheet  resistance 

ohm/C) 

124 

Junction  depth 

nm 

2.0 

Average  mobility 

cm2/vol  t- 

sec  54 

Average  mobility 

cnr/vol  t- 

sec  39.6 

under  emitter 

Eml tter 

6.5xl020 

Surface  concentration 

3 

per  cm 

Sheet  resistance 

ohm/O 

7 

Junction  depth 

um 

1.5 

Average  mobility 

cm2/ volt- 

sec  47 

Alpha  cut-off  frequency 

NPN 

MHz 

143 

PNP 

MHz 

51 

Junction  capacitances 

Emitter-base  at  -3V 

pF/ml l2 

0.42 

Base-collector  at  -3V 

pF/ml  l2 

0.25 

Collector-substrate  at  -3V 

pF/ml l2 

0.08 

Diffused  resistor  values 

Eml tter 

ohms/O 

7 

Base 

ohms/O 

124 

Col  lector 

ohms/O 

100 

Pinched  collector  region 

ohms/O 

472 

Pinched  base  region 

ohms/O 

18K 
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N+  collector  ring  diffusion  does  not  scale  with  Interdigitization.  Single  standard 
transistors  (NPNP)  are  operated  up  to  2 mA  each  for  the  saturating  Inverter  type  or 
4 mA  for  the  nonsaturating  emitter-follower  or  similar  type  where  the  collector 
spreading  resistance  Is  not  limiting.  The  PNP  transistors  are  limited  to  1.5  mA 
collector  current  maximum  and  are  paralleled  for  higher  currents.  Also,  overdrive 
base  current  to  saturating  NPNP's  Is  limited  to  1.5  mA  per  standard.  The  reasons  for 
the  latter  two  limitations  are  the  spreading  resistance  of  the  substrate  and  the 
requirement  that  the  IR  drop  not  exceed  the  contact  potential  of  emitter-base  junc- 
tions for  proper  operation.  The  silicon  substrate  material  Is  P type  0.8  +0.2  ohm-cm. 

The  diffused  collector  resistors  are  either  pinched  or  unpinched.  The  unpinched 
type  Is  normally  used  for  low  resistor  values  and  where  the  circuits  are  less  tolerant 

to  manufacturing  variation.  The  pinched  resistors  are  used  In  the  large  majority  of 

cases  where  the  space  required  Is  to  be  minimized.  Base  and  base  pinched  resistors 
have  been  used  only  In  special  cases.  Undercrossings  or  tunnels  are  provided  by  the 
N+  diffusion  (emitter)  directly  Into  the  substrate.  All  transistors,  resistors,  and 
undercrossings  are  guarded  by  a P+  field  diffusion  (this  amounts  to  guard  ring  con- 
struction practice)  as  shown  In  Figure  45.  A single  level  metallization  is  used  in 
combination  with  tunnels  and  N+  diffusion  to  form  all  Intraconnections. 

6.4  JD  CIRCUITS  ANO  PERFORMANCE 

The  3D  technology  has  been  used  primarily  for  digital  applications.  It  is 
regarded  as  less  attractive  for  analog  functions  because  of  the  limited  voltage 
breakdown  and  gain-bandwidth  of  the  transistors.  However,  It  has  been  used  in  the 
analog  mode  to  provide  reference  voltage  control,  D/A,  and  in  other  utilitarian  ways 
to  support  an  otherwise  all-digital  LSI  function. 

The  digital  logic  used  has  been  primarily  emitter-follower  logic  (EFL).  Typical 
circuit  elements  for  3D-EFL  are  shown  in  Figure  46.  The  AND  gates  are  wired-AND 
made  from  PNP  emitter-followers.  The  OR  gates  are  wlred-OR  made  from  NPN  emitter- 
followers.  This  type  of  utilization  Is  reminiscent  of  diode  logic  used  much  earlier. 
Grounded  emitter  Inverters  are  used  to  complete  the  necessary  logic  capability  and  to 
restore  logic  "1"  and  "0"  levels. 

Restoration  of  logic  level  can  also  be  accomplished  In  the  R/S  flip-flop  shown 
in  the  figure.  This  type  of  logic  Is  used  extensively  within  the  chip  and  operates 
with  a +3  volt  power  supply  and  ground.  Typical  operational  performance  for  pseudo- 
noise generators  havinq  extensive  combinatorial  feedback  qatlnq  usinq  this  logic  Is 
shown  in  Figure  47.  The  noise  margin  maintained  for  this  low  level  logic  con- 
figuration is  relatively  good,  >0.3  volt  for  worst  case  conditions,  including 
temperature  and  end-of-life  wearout. 
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Figure  47.  Operational  Performance  of  Typical  3D-EFL  LSI 
Temperature  and  Power  Supply  Voltage 


The  transfer  characteristics  for  the  R/S  flip-flop  circuit  in  Figure  46  are 

shown  In  Figure  48  for  the  temperature  range  of  -40°C  to  125°C  and  under  EDL 

(0  >6.0,0  >3.0). 
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Figure  48.  EFL  R/S  Flip-Flop  Transistor  Characteristics 

Input  and  output  circuits  compatible  with  TIL  are  shown  In  Figure  49.  These 
use  a 5 volt  power  supply  and  maintain  a higher  noise  margin  for  chi p-to-sys tern 
Interface.  Totem  pole  outputs  are  typical  for  TTL  interface  with  tristate  control 
also  available.  Input/output  Interfaces  are  shown  for  EFL,  TTL,  and  CML.  Large 
fanout  50  ohm  drivers  have  also  been  used  for  on-chip  clock  buffers  and  off-chip 
loads.  For  those  cases,  small  standard  transistors  are  paralleled  in  sufficient 
numbers  to  divide  and  distribute  the  required  current. 

A particular  EFL  logic  configuration  using  long  cascades  of  AND-OR  gating  has 
worked  out  especially  well  in  practice.  This  has  been  used  in  a 1 6x1 6-bi t parallel 
multiplier  using  the  successive  add  algorithm.  The  logic  level  is  not  restored 
until  reaching  the  product  output. 
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TTl  TOEFL 


EFL  TO  TTL 


Figure  49.  Input/Output  Circuits 

Throughout  the  long  logic  chain,  both  true  and  complement  signals  are  generated 
and  propagated.  These  pairs  are  finally  detected  using  a differential  detector  and 
level  restorer  at  the  output.  The  main  advantage  of  this  type  of  logic  chain  is 
found  in  the  reduction  of  signal  propagation  time,  since  the  gate  delays  obey  a 
sunmation  principle  which  is  less  than  linear  summation  and  approximately  RSS  suima- 
tlon.  As  an  example,  for  a low-power  full-adder  EFL  circuit  exhibiting  a single 
stage  delay  time  of  14  nsec,  a cascade  of  32  such  stages  exhibits  less  than  100  nsec 
propagation  delay.  Figure  50  shows  the  attenuation  and  propagation  delay  as  a 
function  of  the  number  of  cascaded  stages. 

Recent  extensions  of  30-LSI  in  the  circuit  area  Include  current  mode  logic  (CML). 
This  form  of  logic  extends  the  frequency  range  to  approximately  30%  of  the  alpha  cut- 
off frequency  of  the  devices  ar.d  exhibits  excellent  delay-power  products.  This  is  a 
differential  form  of  logic;  however,  it  can  also  be  operated  single  ended  by  using  a 
dc  reference  voltage  on  one  side. 

CML  can  be  made  compatible  with  EFL,  and  there  are  logic  advantages  in  doing  this 
CML  performs  register  logic  exceedingly  well,  but  has  a more  limited  capability  for 
combinatorial  logic.  On  the  other  hand,  EFL  properties  are  better  for  gating,  and 


Figure  50.  Propagation  Delay  for  EFL  and/or  Cascade 


this  Is  where  CMl-EFL  has  an  Important  role.  An  example  of  EFL  logic  to  CML  D type 
flip-flop  with  output  restored  to  EFL  levels  Is  shown  In  Figure  51.  This  circuit 
dissipates  32  mW.  The  setup  time  is  less  than  10  nsec,  and  the  propagation  delay  Is 
25  nsec  for  a 80  pF  load. 

For  a long  while,  It  was  not  realized  that  a true  TTL  coupling  transistor  could 
be  used  with  30.  The  true  TTL  coupling  transistor  opeates  In  a continuously  satur- 
ated mode  In  the  sense  that  the  base-collector  Is  always  forward  biased.  Independent 
of  whether  the  logic  state  of  the  gate  Is  tme  or  false.  Under  these  conditions  for 
30,  the  overdrive  base  current  Is  diverted  to  the  substrate  by  the  high  up,  «p  >0.9 
of  the  30  substrate  transistor  action.  Figure  52  shows  the  problem.  So  It  Is 
obvious  that  30  cannot  be  used  for  TTL.  Or  Is  It? 

As  usual,  solutions  to  long  standing  problems  tend  to  be  simple  and  of  the  "why 
didn't  I think  of  that  earlier"  category.  Figure  53  shows  the  obvious  solution. 

The  addition  of  resistor  R2  solves  the  problem.  Reference  to  the  NPNP  transistor 
model  shown  in  Figure  45  reveals  that  none  of  the  current  available  at  the  collector 
of  T1  can  be  diverted  by  PNP  action  to  the  substrate,  and  a current  like  I?  Is  there- 
fore fully  available  for  external  drive  requirements.  Resistor  values  R1  and  R2  are 
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Figure  53.  Same  Circuit  with  Addition  of  Resistor  R2. 

I?  Cannot  Flow  to  Substrate  and  Full  Value 
of  I2  Current  is  Available  for  Drive 


selected  to  maintain  a forced  beta  of  4,  so  for  this  case  only  25%  of  the  current  is 
diverted  to  the  substrate.  Ij,  of  course,  performs  valuable  service  by  maintaining 
a continuously  saturated  condition  for  T1 . 

Other  methods  which  obtain  a reduced  saturated  condition  for  T1 , such  as  shunting 
a resistor  around  the  base  to  collector  of  T1 , could  also  be  used;  however,  these 
result  in  an  Increased  collector-to-emitter  voltage  at  T1  with  a corresponding  loss  in 
noise  margin  for  the  gate.  Placing  a Schottky  clamp  on  the  base  to  collector  of  T1 
would  avoid  the  problem,  but  it  also  causes  a loss  in  noise  margin  and  a loss  in  pro- 
duclblllty  through  the  requirements  for  more  complicated  processing. 

Taking  advantage  of  this  simple  scheme,  two  low  power  TTL  gates  were  designed 
into  3D  to  Implement  a universal  array  type  of  LSI,  called  CGA  (for  configur- 
able gate  array). 


♦av 


Figure  54.  One  Milliwatt  3D  TTL  Gate,  ipLH  - 30  nsec, 
TpHL  " nsec  ^or  RL  * 4K,  CL  » 30  pF 
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6.5  DEFECT  FREE  YIELD  OF  3D-LS1 


The  singular  valid  reason  for  a technology  laying  claim  to  being  an  LSI  tech- 
nology Is  that  It  exhibit  economic  produclbl 1 Ity  or  high  yield  for  complex  circuits. 

A reasonable  way  to  measure  the  yield  and  compare  this  with  other  competing  processes 
Is  to  construct  a graph  of  chip  size  (area)  versus  yield.  Using  statistical  distri- 
bution arguments,  Dingwall  among  others  has  placed  a parameter  D„  on  such  a plot 

2 0 

which  Is  descriptive  of  the  number  of  defects  per  unit  area  (cm  ).  A series  of  curves 
drawn  for  3D  In  Figure  55  shows  a progressive  trend  toward  less  Do,  l.e.. 
Increased  produclbll Ity.  As  shown,  this  Is  approximately: 


Dq  Defects/cm2  Year 


10  1972 

3 1974 

1.5  1976 


The  yield  Is  plotted  for  chips  passing  a functional  test  at  wafer  probe,  since 
this  Is  most  descriptive  of  the  results  of  the  wafer  fabrication  process.  Subsequent 
dicing,  packaging,  visual  Inspection,  and  final  testing  lose  approximately  20% 
beyond  the  wafer  probe  test.  This  Is  largely  Independent  of  wafer  processing. 


Figure  55.  3D-LSI  Yield 
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SECTION  VII 

DIGITAL  FILTER  DEMONSTRATION  UNIT,  DFDU 

The  Digital  Filter  Demonstration  Unit,  DFDU,  Is  a generalized  FFT 
System  that  can  be  used  to  demonstrate  the  characteristics  of  FFT's  as 
well  as  systems  utilizing  FFT's.  The  unit  Includes  an  analog  preampli- 
fier, a filter,  A/D  converter,  and  a VCO  for  producing  modulated  FM  as 
well  as  direct  low  frequency  input  signals.  It  also  includes  an  FM  d 
demodulator  and  audio  detector  and  amplifier.  A block  diagram  display  o 
of  the  UFDU  functions  is  shown  in  Figure  56.  There  are  basically 
three  overlapped  processes  implemented  in  the  DFDU:  the  dual  input 
memory  collects  8-bit  sampled  data  at  a switch  selectable  rate  and 
block  size. 

The  other  input  memory  is  used  to  Input  data  to  the  FFT  processor  In 
the  first  level  of  operation.  The  FFT  processor  performs  one  butterfly 
operation  at  a maximum  rate  of  480  ns  per  butterfly  in  a pipelined  fashion 
and  stores  Intermediate  results  Into  Its  own  memory  which  is  1024  x 24. 

The  process  and  sample  rates  are  all  derived  from  the  programmable  system 
clock.  A block  diagram  of  the  complete  system  Is  shown  in  Figure  57. 

The  processor  can  be  selected  to  perform  an  N point  FFT  (where:  N*16; 
32;  64;  128;  256;  512  or  1024)  at  a sample  rate  of 

s-fenir(wW”‘-1  • 

where  t Is  panel  selectable  from  1 to  999.  The  window  period  becomes 

WP  -[12(t+l)N  log2N]  x 10-*'  sec. 

The  filter  spacing  is 

FS  (filter  spacing)  ■ ^pya- ■ 

a • Nyqulst  Rate, general ly  > 2 . 

A block  diagram  of  the  details  of  the  FFT  Kernel  Processor  Is  shown  In 
Figure  58. 
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SPECTRUM  OUT 


Figure  57.  Digital  Filter  Demonstration  Unit 


MICROPROGRAMMABLE 


Figure  58.  DFDU  FFT  Kernel  Processor 


The  output  of  the  FFT  processor  Is  loaded  Into  the  output  memory  during 
the  last  level  of  the  FFT  process.  The  output  window  processor  accesses 
complex  data  from  its  memory  and  performs  the  windowing,  then  it  converts 
the  vector  to  Its  magnitude  and  accumulates  the  outputs  over  the  panel 
selectable  integration  Interval  in  the  accumulator  memory.  Figure  59  is 
a block  diagram  of  the  window  processors. 

The  output  frequency  discriminator  processor  scans  the  N frequency  com- 
ponents from  the  accumulator  memory  and  holds  the  addresses  (frequency)  of 
the  four  largest  components. 

These  values  are  accessed  by  queue  of  the  output  FM  demodulator  hard- 
ware. The  code  is  used  to  drive  a D/A  that  drives  a VCO  for  demodulating 
the  input  FM  signal.  These  feedback  paths  may  be  deleted  or  bypassed  when 
used  in  other  modes.  They  serve  to  self  test  the  DFDU  by  running  it  In  a 
closed  loop  configuration.  FFT  and  window  processors  and  memory  busses  can 
be  monitored  both  digitally  by  samplings  at  the  selected  time  slot  or  in 
continuous  operation  by  monitoring  any  selected  buss  through  three  D/A 
display  channels.  There  are  two  digital  channels,  each  channel  can  select 
any  of  several  input  sources  and  is  useful  for  maintenance  and  viewing  the 
process  In  various  levels  of  its  operation,  as  well  as  the  final  output 
spectrum  by  the  use  of  an  external  lab  oscilloscope.  Figure  60  shows  the 
block  diagram  for  the  Maintenance  and  Display  console. 

Pictures  of  the  DFDU  are  shown  in  Figure  61  through  63.  The  first 
shows  the  stand  alone  electronics  in  a cabinet.  The  second  shows  the  display 
panel  and  the  third  shows  the  front  panel  opened  to  reveal  the  PC  wire  wrap 
construction. 
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Figure  59.  DFDU  Window  Processor 
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SECTION  VIII 
OVERALL  CONCLUSIONS 

Definitive  conclusions  were  reached  in  several  areas  regarding  the  Imple- 
mentation of  LSI  for  use  in  digital  filter  processing  applications.  These  are: 

1)  The  LSI  replacement  of  MSI  in  high  arithmetic  function  density 
areas  sucli  as  multiplication-accumulation  can  show  a four  to 
one  reduction  in  power  and  cost,  and  one  64-lead  LSI  package 
can  replace  greater  than  50  16-lead  MSI  packages. 

2)  Taken  at  a major  system  function  level  such  as  an  FFT  processor, 
the  use  of  LSI  (custom)  in  combination  with  MSI  (off  the  shelf) 
versus  an  all  MSI  (off  the  shelf)  results  in  a recurring  cost 
reduction  of  23%.  If  three  custom  LSI  designs  are  required  for 

a typical  new  system  of  this  type  the  breakeven  number  of  systems 
Is  approximately  100  to  amortize  the  development  cost.  (See 
section  2.4.) 

3)  For  an  LSI  technology  to  qualify  as  a practical  cost  effective 
implementation  means,  the  manufacturing  defect  density  must  be 
less  than  about  four  defects  per  cm2  at  a chip  level  gate  density 
of  greater  than  100  gates  per  n«*2.  In  1977  the  3D  technology 
exhibits  less  than  2 defects/cm2  at  greater  than  200  gates  per 
mm2.  The  3D  technology  was  the  means  used  for  LSI  fabrication 

on  this  program. 

4)  Given  a physical  technology  as  described  above  in  c,  the  most 
challenging  aspect  of  the  LSI  is  deriving  a suitable  partitioning 
for  effective  system  utilization  and  thereby  specifying  the  LSI 
design.  A major  problem  against  effective  LSI  utilization  is  the 
limitation  In  LSI  pins  available.  Sixty  to  ninety  pins  is  con- 
sidered the  upper  limit  for  reliable  inexpensive  packaging. 

The  results  of  this  program  show  that  suitable  engineering 
compromise  can  be  reached. 
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5)  Practical  packaging  and  handling  of  power  dissipation  In  LSI 
over  the  military  temperature  range  can  be  realized  In  64 
lea<<  oackages,  either  flat-pak  or  DIP  and  up  to  5 watts. 

6)  A general  purpose  LSI  design  called  the  SPAU  2 was  shown  to 

be  highly  effective  for  all  sorts  of  digital  filter  generic  type 
problems.  This  design  Is  based  on  a multiplier-accumulator 
type  circuit  organization.  An  earlier  design  called  the  SPAU(l) 
was  a stepping  stone  to  this  final  design.  Two  other  hardware 
designs,  the  SPDL  and  SPAC  LSI  had  less  general  and  more  specific 
utilization  In  FFT  applications. 

It  Is  apparent  from  the  results  of  this  work  that  implementation  of 
Avionic  systems  can  greatly  benefit  from  the  LSI/VLSI  technology.  It  was 
shown  by  means  of  the  delivered  hardware  on  this  program  that  practical  tech- 
nology exists  at  the  required  complexity  level  to  meet  these  needs. 
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APPENDIX  A 

SPECIFICATIONS  AND  APPLICATION  NOTES  FOR  SPAU  1 


Note:  This  was  the  first  Signal  Processing  Arithmetic  Unit,  SPAU, 
designed.  This  Is  labeled  "SPAU"  throughout  this  Appendix  A. 
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ASSUMPTIONS  AND  INTRODUCTION 


The  LSI  3D  EFL  Signal  Processing  Arithmetic  Unit  chip  (SPAU)  contains 
control  networks,  control  storage,  a 12x12  multiplier,  two  twelve  bit  adder- 
subtractors,  a divide  by  two  network,  inter-connecting  paths  and  tristate 
outputs.  This  specification  of  the  arithmetic  unit  contains  a detailed 
description  of  the  functions  and  electrical/delays  of  the  signal  processing 
arithmetic  undt. 

1.0  FUNCTIONAL 

1 .1  Operand  Holding  Registers 

There  are  seven  data  storage  registers  contained  within  the  SPAU  chip. 
They  receive,  load  under  gated  clock,  and  hold  data  from  pin  inputs,  internal 
paths  or  pin  inputs,  or  arithmetic  functions.  The  registers  are  placed  to 
provide  pipelined  processing  and  may  be  simultaneously  clocked, advancing 
elements  (the  function  between  two  clocked  registers)  along  a data  pipe- 
line. Table  1 lists  the  registers  and  their  use.  These  registers  are 
shown  in  the  functional  block  diagram  Figure  1. 

1 .2  Configuration  Control  Register 

One  register  containing  five  bits  holds  the  configuration  control  of 
each  SPAU  LSI. The  configuration  register  holds  the  static  or  low  rate  data 
path  controls  as  compared  to  input  pins  which  drive  high  rate  data  path 
controls.  This  register  is  designed  in  edge-triggered  D flip-flops  to 
facilitate  overlapped  control  register,  simultaneous  control  (ROM)  address 
register,  and  AU  configuration  register  clocking.  The  function  of  each  low 
rate  control  .stored  in  the  configuration  register ,1s  given  in  Table  2.  This 
register  is  clocked  by  the  Cl  control  clock.  The  Cl  clock  is  distributed  to 
all  SPAU’s  in  a given  configuration. The  inputs  to  the  configuration  register 
are  time  shared  with  s(SN-ll)  data,  the  data  lines  of  lowest  use  per  pipe- 
line period.  A pipeline  period  is  the  longest  time  between  clocked  elements 
within  the  pipeline. 
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There  are  three  control  lines  which  change  with  sufficient  frequency 
to  warrant  their  independence  of  timeshared  interleaving  with  data  on  the 
s(SN-ll)  lines.  The  function  of  these  lines  is  given  in  Table  3. 

1.4  Data  Register  Pipeline  Clock 


The  SPAU  chip  uses  a gated  continuous  clock  to  strobe-latch-capture  data 
in  the  pipeline  stream.  The  C2  clock  strobes  the  clock  decoder  and  produces 
from  one  to  four  relatively  simultaneous  clocks.  One  pin  is  allocated  for 
the  C2  clock  and  four  pins  for  the  clock  code.  The  sixteen  clock  codes  and 
the  register  clocks  produced  are  given  in  Table  4.  The  C2  clock  is 
distributed  to  all  SPAU's  in  a given  system  configuration. 

1.5  Inputs 

There  are  three  tristate  bus  inputs  to  each  SRAU  LSI. These  are  t(SN-ll)T, 
d(SN-ll)T,  and  s (SN-11 )T . Each  is  twelve  bits  in  width  representing  frac- 
tional fixed  point  two's  complement  data  in  the  true  sense  (logic  "1").  The 
distribution  of  each  input  is  given  in  Table  5 after  the  input  passes  through 
a synchronous  reset  override. 

1.6  Outputs 

There  is  one  TTL  compatible  trl state  twelve  bit  two's  complement 
fractional  output  per  AU  LSI  chip.  The  output  source  is  the  internal  RX 
bus  passing  through  a divide  by  two  or  normal  multiplexer.  The  data  out- 
puts are  generated  in  a logic  "1"  signal  high  true  sense. 


1.7  Multiplier  and  Adder  1 

The  multiplier  performs  two's  complement  fractional  multiplication  of 
the  values  stored  in  the  multiplier  register  T(SN-ll)  and  the  multiplicand 
register  D(SN-ll)  forming  a twenty-three  bit  product.  The  round  configura- 
tion control  bit  provides  rounding  or  truncation  of  the  product  to  twelve 
of  the  most  significant  bits  including  sign. 
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Adder  1 Is  left  justified  at  (SN-U)  In  the  product  range  (SN-  22). 

Adder  1 is  implemented  as  one  level  of  adder  extension  beyond  the 
adders  forming  the  partial  products.  The  product  and  sum  are  formed 
simul taneously,  their  networks  being  combined.  The  augend  source  Is 
kA(SN-U),  the  one's  complement  or  true  value  In  A plus  the  carry  Injection 
Into  the  adder  bit  11  equal  a logic  "one"  or  "zero"  respectively  for  sub- 
tract or  add.  The  adder  1 add  time  Is  equal  to  the  multiply  time,  approxi- 
mately 140  nanoseconds  from  the  output  of  T,  D,  or  A (whichever  Is  loaded 
last)  to  the  input  of  R discounting  the  set-up  time  for  R. 

1 .8  Adder  2 

The  second  adder  in  the  SPAU  Is  disjoint  from  the  multiplier  and 
adder  1.  Its  functions  operate  in  full  overlap  with  the  multiply  and  add 
in  the  other  section  and  at  approximately  twice  its  rate  ( 94  nanoseconds 
from  XS ( SN- 1 1 )Q  or  tP(SN-ll)T. 

Adder  2 performs  a fast  sum  and  difference  for  two  loaded  operands  and 
is  used  as  a self-contained  high  speec  accumulator. 

Adder  2 limits  for  maximum  positive  and  maximum  negative  overflow.  The 
value  (2°-2“n)is  forced  Into  the  X register  for  positive  overflow.  The 
value  (2°+2"u)is  forced  into  the  X register  for  negative  overflow. 


1 .9  Reset  Override 

The  periodic  reset  employed  by  signal  processors  in  accumulate  and 
dump  Is  Implemented  as  a synchronous  reset  override  at  the  input  t(SN-ll)T, 
s(SN-U)T,  and  d( SN- 1 1 )T . The  only  cleared  register/s  is/are  those  which 
are  clocked  during  the  presence  of  the  reset  at  the  input  pin. 
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Table  2 

CONFIGURATION  CONTROL  REGISTER 


2.0  ELECTRICAL 


i ' 


2.1  Registers 

The  SPAU  3D-EFL  LSI  chip  contains  eight  registers,  five 
single  rail  D edge  triggered  flip-flops  and  three  full  clock  dual  rail  R/S 
latches.  The  specifications  for  Inputs  to  these  EFL  Implemented  registers 
are  given  In  Tables  6 and  7. 

The  clocking  of  these  registers  Is  performed  by  a strobed,  decoded 
clock  select.  The  clocks  to  the  eight  registers  are  enabled  from  a four 
bit  select  code  presented  to  the  clock  decode  Input  of  the  AU  , Table  4. 

2.2  Clock  Select  Codes 

The  timing  of  the  clock  select  codes  and  clocks  Is  given  In  Figures 
2 and  3 and  the  tolerances  of  the  clocks  are  given  In  Table  8. 

2.3  Trl state  Outputs 

The  tristate  outputs  of  the  SPAU  are  specified  as  follows.  This 
states  a capability  of  a 25  nanosecond  risetime  for  a maximum  a.c. 
load  of  50  pF  In  conjunction  with  a d.c.  load  of  6 EFL  or  two  54S  loads. 


The  use  rule  is  or  will  be  a d.c.  maximum  of  four  low  power  Schottky 
loads,  three  bused  outputs  for  24  picofarads  of  "off"-output  device  capacitance 
(12  per)  and  24  picofarads  of  Input  device  capacitance  (6  per),  and  no  more 
than  10  picofarads  for  Interconnect.  (30  pF  at  10  Inches  using  lamlnants 
28  mil  thick.) 

The  tristate  outputs  of  the  SPAU  are  specified  In  Table  9 and  Figure  4. 

A low  Input  at  BZZF  enables  the  tristate  device  and  the  OnnBT  outputs  are  in- 
verseof  the  state  of  the  Internal  bus  RXnnB.  This  provides  a faster  disable 
than  enable  for  the  tristate  devices  and  Insures  compatablllty  with  other 
54125  devices  of  AU's  on  the  bus. 

2.4  Data  Inputs 

The  SPAU  chip  has  three  twelve  bit  Input  paths.  The  "tSNT-tllT"  path 
Inputs  to  an  overriding  clear  gate  of  one  d.c.  load.  The  "dSNT-dllT"  path 


I 


126 


Inputs  to  an  overriding  clear  gate  of  one  d.c.  load.  The  "sSNT-sllT"  path 
Inputs  to  an  overriding  S clear  gate  of  one  d.c.  load.  The  specifications 
for  these  Inputs  are  given  In  Table  10  and  Figure  3. 

2.5  Select  Control  Inputs 

The  high  rate  control  select  Inputs  drive  two  type  two  Inverters.  The 
specifications  for  these  Inputs  are  given  in  Table  11,  13  and  Figure  5. 

2.6  Configuration  Control 

The  low  rate  control  selects  are  held  In  the  configuration  control  D 
edge  triggered  flip-flops.  This  register  receives  complemented  s Inputs 
and  Is  clocked  by  the  control  clock.  The  specification  for  this  register 
Is  the  same  as  the  0 flip-flop  specification  In  Table  6.  The  propogatlon 
delays  to  the  using  destinations  are  given  In  Table  12  as  Illustrated  In 
Figure  6. 

2.7  High  Power  Drive  External  Requirements 

One  Input  line  is  heavily  loaded  on  each  SPAU.  This  line  is 
listed  In  Table  13  for  the  proper  choice  of  external  drivers. 

2.8  Supply  Power 

The  predicted  power  requirements  of  the  signal  processing  arithmetic 
unit  (SPAU)  LSI  are  given  In  Table  13. A. 
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TABLE  7 


R/S  LATCH  INCLUDING  SET-UP  & TRANSFER  GATES 


PARAMETER 

MNEMONIC 

MIN 

TYP 

MAX 

UNIT 

Supply  Voltage 

4.5 

5.0 

5.5 

V 

Clock  Pulse  Width 

30 

60 

ns 

Clear  Pulse  Width 

112 

120 

ns 

Data  Hold  Time 

* hold 

60 

90 

ns 

Q Prop  Delay-Low  to  High-Clock 

tpLH 

50 

60 

ns 

Q Prop  Delay-High  to  Low-Clock 

tpHL 

22 

30 

ns 

Data  Set-up  S Low 

tsetL 

36 

ns 

Data  Set-up  S High 

tsetH 

19 

ns 
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TABLE  8 

CLOCK  SPECIFICATION 


FUNCTION 

PARAMETER 

MNEMONIC 

MIN 

TYP 

MAX 

UNIT 

Set  Up,  Input  to  Strobe 

'cs 

32 

39 

ns 

Input  High  Level 

VQIH 

2.0 

3.4 

V 

Input  Threshold 

vqiT 

V 

Clock  Code 

Input  Low  Level 

VQIL 

.3 

0.8 

V 

Input 

Rise  Time 

V 

20 

30 

ns 

Fall  Time 

V 

21 

30 

ns 

Hold  Time 

*hold 

84 

120 

ns 

Input  Capacitance 

15 

pF 

Rise  Time 

V 

4 

8.0 

ns 

Fall  Time 

tf 

4 

8.0 

ns 

Symmetry  Deviation 

sym 

10 

t 

Input  High  Level 

VCIH 

2.0 

3.4 

V 

Clock  Input 

Input  Threshold 

VCTH 

V 

(C2CLKF) 

Input  low  Level 

VCIL 

.3 

0.8 

V 

Skew  (Distribution) 

*skewD 

9 

ns 

Strobe  A Unload 

*cu 

30 

37 

ns 

Skew  (Internal) 

Skewl 

4 

ns 

Input  Capacitance 

15 

pF 

Supply  Voltage 

Vcc 

4.5 

5 

5.5 

V 

Input  Current  High 

1 IH 

mo 

200 

vA 

Input- Current  Low 

!IL 

-200 

-400 

tiA 
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PARAMETER 


Low  Level  Input  Volts 
High  Level  Input  Volts 
Supply  Voltage  Volts 
Low  Level  Input  Current 
High  Level  Input  Current 
Input  Capacitance 
Control  Propagation 
Control  Propagation 


MNEMONIC 


TABLE  12 


CONFIGURATION  DELAYS  FROM  CLOCK 


FUNCTION 

s 

STATE 

PARAMETER 

MIN 

TYP 

MAX 

'l 

UNIT 

Scale 

1 

Propagation  Low-High 

Vh 

45 

72 

ns 

No- Scale 

0 

Propagation  High-Low 

lpHL 

45 

72 

ns 

d to  P reg 

1 

Propagation  Low-High 

Vh 

45 

72 

ns 

RX  to  P reg 

0 

Propagation  High-Low 

%HL 

45 

72 

ns 

T - +1 

1 

Propagation  Low-High 

Vh 

45 

7? 

ns 

T * t 

0 

Propagation  High-Low 

Vh 

45 

72 

ns 

Round 

1 

Propagation  Low-High 

Vh 

45 

72 

ns 

Truncate 

0 

Propagation  High-Low 

tpHL 

45 

72 

ns 

SUBA 

1 

Propagation  Low-llgh 

Vh 

47 

ns 

ADDA 

0 

Propagation  High-Low 

lpHL 

40 

62 

ns 

s to  reg 

1 

s to  conflg  clock 

tSETL* 

33 

55 

ns 

s to  reg 

0 

tSETH 

. 

35 

55 

ns 

* Assumes  50%  derating  for  worst  case  timing. 
And/Or  15/15 

#2  Inverter  M/3  loads  25/10 
#1  Inverter  W/l  load  10/25 
Level  Resistor  Divider  5/5 


TPLH/TPHL 
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TABLE  13 


SIGNAL 

PARAMETER 

MNEMONIC 

RSTF 

High  Level  Input  Voltage 

VIH 

Low  Level  Input  Voltage 

VlL 

Input  Capacitance 

CI 

High  Level  Input  Current 

*IH 

low  Level  Input  Current 

1 IL 

Propagation  Time,  High  to 

Low  Level 

tPDHL 

Propagation  Time,  Low  to 

High  Level 

tPDLH 

bzz  F 

High  Level  Input  Voltage 

VIH 

Low  Level  Input  Voltage 

VIL 

Input  Capacitance 

CI 

High  Level  Input  Current 

!ih 

Low  Level  Input  Current 

!il 

Propagation  Time,  High  to 

Low  Level 

tPDHL 

Propagation  Time,  Low  to 

High  Level 

tPDLH 

MAX  UNIT 


L ljL/Input  ■ 20^A;  IJH  ■ 5^A 
* 4pF  pad  + (2pF  x 12  Inputs)  ♦ 24pF  metal 


2.0  3. 


.6 


12  pF 


.5  ma 


10 

ns 

20 

ns 

3 
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PREDICTED  POWER  REQUIREMENT 


PARAMETER 

MIN 

TYPICAL 

MAX 

Case  Temperature 

-40 

25 

+125 

C° 

Vcc 

4.5 

5.0 

5.5 

V 

Power  Requirement 

3.16 

5.14 

7.0 

w 

Icc  Total 

630 

1,028 

1,500 

mA 

Icc  per  pin 

210 

343 

500 

mA 

R1  Icc 

.1 

ohm 

per  pin 

.3 

ohm 

Vgg  Chip  - Vgg  Package 

.060 

.103 

.150 

V 

Vcc  Chip  - Vcc  Package 

.060 

.103 

.150 

V 
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3.0  Functional  Timing 

The  maximum  propagation  times,  currents,  and  capacitances  are  derated 
20%  for  over  temperature  performance,  -40°C  125°C  for  any  differences  in 

the  production  of  wafers.  The  typical  values  are  those  times  expected  in 
performance  at  25°C  ambient. 

3.1  Operation  Timing 

The  Arithmetic  Unit  (SPAU)  executes  a function  by  advancina  data 
elements,  generally  from  a sequential  array,  through  the  AU.  The  input  of  arrav- 
element  data,  the  processing  of  the  previous  data  element  inputs,  and  the 
output  of  a result-element  are  concurrent  operations  which  are  executed  in 
overlap. 

3.1.1  Transfer-Element  Timing 

The  loading  and  dumping  of  arithmetic  unit  LSI  registers  are  performed 
in  blocks  of  time  from  clock  loading  source  register  to  clock  loading  des- 
tination register.  These  blocks  of  time  are  called  transfer-elements.  There 
are  internal  and  external  transfer-elements,  those  associated  with  register 
to  register  transfer  on  the  same  AU  and  those  associated  with  inputting  and 
outputting  into  and  from  the  AU  .respectively.  There  are  three  possible  con- 
current transfer-elements  per  AU.  Inputs  must  be  held  across  a delayed  clock  RS 
"well",  while  internal  and  output  transfers  are  edge  triggered  and  require 
only  a 5 nanosecond  "hold  time".  Table  14  details  the  worst  case  and  typical 
propagation  delays  through  the  possible  transfer-elements.  These  times  are 
adjusted  to  reclocking  at  a common  system  clock  rate,  providing  margin  for 
excess  time  i-n  transfer-elements  before  clock  arrival.  See  Section  3.2  for 
a discussion  of  clocks. 

3.1.2  Process-Element  Timing 

There  are  two  process -elements  nested  between  transfer  elements  and 
a scaling-element  included  in  the  output  transfer-element.  A process-element 
spans  the  time  between  the  leading  edge  of  the  clock  loading  an  operand 
register  to  the  leading  edge  of  the  clock  latching  the  arithmetic  result 
register.  The  process-element  is  equated  to  the  result  register's  alpha 
character  mnemonic. 
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The  two  process -elements  are: 

R = D * T + kA;  k * tl  (1) 

X * S + 1 P;  i * tl  (2) 

Both  (1)  and  (2)  are  overlapping,  concurrent  processes.  Table  15  details 
the  worst  case  and  typical  propagation  delays  through  these  process-elements. 

The  clocking  of  transfer-elements  also  steps  data  through  process -elements. 

The  transfer-elements  serve  to  move  data  efficiently  through  cascaded  process- 
elements  forming  a pipeline.  The  clock  codes  serve  to  step  the  Inputs  and 
outputs  of  process -elements  with  one  clock  with  the  Intent  of  doing  so  as 
close  to  margins  as  Is  practical.  In  this  manner,  a given  process -element 
Is  executed  at  maximum  utility. 

3.2  Clocking 

The  assumed  clocking  scheme  for  the  system  uses  a symmetric,  complementary 
two  phase  clock  (one  the  complement  of  the  other).  This  permits  use  of  the 
standard  16.66  MHz  oscillator  In  contrast  to  two  phase  asymmetric,  noncomp- 
lementary clocks  generated  by  a 50  MHz  oscillator. 

The  SPAU  uses  a clock  strobed,  decoded  clock  tor  each  Internal  reqlster. 

The  clock  code  Is  accessed  from  ROM  or  PROM  one  clock  cycle  before  use  and 

Is  retimed  with  the  clock  for  use.  Retiming  should  be  performed  ustng  H or  S-TTL 

propagation  speed  registers.  The  54S1 94 meets  the  qualification  with  margins 

at  both  leading  and  trailing  edges  of  the  code.  The  54175  A 54174  does  not  provide 

sufficient  leading  edge  margin  (which  Is  true  for  EFL  also).  The  54S174 

and  54S175  satisfies  the  leading  and  trailing  edge  cases;  however,  without  trailing 

edge  margin.  Tables  16,  17,  and  18  and  Figures  7,  8,  and  9. 

A quasi  radial  clock  distribution  Is  required  to  minimize  external  clock  skew. 

The  limit  of  10  nanoseconds  Is  Imposed  on  any  stub  of  the  radial  fanout.  The 

following  limits  are  established  for  skew  per  Item  per  stub. 

(1)  maximum  54 SI 40  skew  4 nanoseconds 

(2)  maximum  transmission  line  2 nanoseconds 

(3)  4 - 30  pf  loads  4 nanoseconds 

The  clocking  of  MSI  registers  and  RAM's  In  the  transfer  elements  Is 
performed  once  In  240,  360,  480  nanoseconds  by  60  nanosecond  slots  nroduced 
by  a ring  counter  with  60,  120,  180,  240,  300,  360,  420,  and  480  nanosecond 
taps  and  a recirculate  period  of  480  nanoseconds. 
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The  selected  tap  using  54S194's  represents  a clock  offset 
maximum  of  20  nanoseconds  using  54128  clock  drivers  for  unloading.  The 
criteria  for  tap  usage  is  dependent  upon  worst  case  propagation  delays 
from  a source  register  through  a transfer  element  to  a destination  register 
and  the  tap  or  closest  timing  to  a tap  used  to  clock  the  source  register. 

In  the  4AU  FFT  processor,  the  most  common  asymmetric  clock  from  the 
ring  counter  has  a period  of  480  nanoseconds  and  a pulsewidth  of  60. 

An  example  implementation  for  dump  X or  R to  TTL  is  derived  from  Tables 
14  and  17,  and  Figure  8.  Assume  X is  clocked  by  the  edge  129  nanoseconds 
(92  + 37)  displaced  from  tap  240 ' s leading  edge  maximum  worst  case.  Using 
LS-TTL  170  nanoseconds  minimum  must  be  allowed  for  the  transfer  element  and  the 
tap  60  leading  edge,  displaced  540  nanoseconds  from  0 would  clock  in  data. 

The  transfer  time  allowed  is  300  nanoseconds,  the  worst  case  propagation  including 
skew  and  20%  margin  is  299  nanoseconds.  The  typical  propagation  time  at 
25°C  ambient  is  214  nanoseconds  (111  + 103). 

The  C1CLKF  clock  which  strobes  the  configuration  register  directly  is 
a gated  clock  which  generally  is  issued  during  reset  initialization.  This 
gating  Is  application  dependent  and  will  not  be  treated  here.  Note  enter  k 
in  Table  15  is  the  set-up  of  the  configuration  register.  The  C1CLKT 
clock  is  delayed  a maximum  of  22  and  a minimum  of  10  nanoseconds  before 
being  applied  to  the  configuration  register.  Note  that  the  worst  case 
maximum  propagation  of  configuration  set-up  using  LS-TTL  requires  3 clock 
periods  or  180  nanoseconds.  Substituting  H-TTL  reduces  the  configuration 
set-up  to  2 clock  periods  or  120  nanoseconds. 
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TABLE  14 


Transfer  Elements, 

Internal 

t 

t 

t 

L 

X or  R to  P (RTBT  Static) 

alloc 

max 

R Reg  CML  to  EFL 

30 

20 

1 

0.  C.  RX  Bus  Driver 

25 

15 

0.  C.  Bus 

30 

15 

1 

P Input  Multiplexer 

30 

19 

P Set-up 

15 

15 

240 

130 

84 

R to  P RTBT  (1-0) 

External  Clock  Skew 

8 

5 

545174  (Control) 

17 

12 

- 

External  Propagation 

8 

3 

0 

RTBT  Inverter 

22 

15 

1 

0.  C.  RX  Bus  Driver 

25 

15 

- 

0.  C.  Bus  (to  IV) 

30 

15 

1 

P Input  Multiplexer 

30 

19 

P Set-up 

15 

15 

240 

155 

99 

X to  P RTBT  (0-1) 

External  Clock  Skew 

8 

5 

54  and  174  (Control) 

12 

8 

External  Propagation 

8 

3 

RTBT  (Double-Inverter) 

35 

22 

0.  C.  RX  Bus  Driver 

25 

15 

0.  C.  Bus  (to  IV) 

30 

15 

P Input  Multiplexer 

30 

19 

P Set-up 

15 

15 

240 

163 

102 

I 


A 
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Transfer  Elements,  External 


t 

t 

t 

L 

X or  R to  RAM  alloc 

max 

typ 

X Register  (Includes  CML  to  EFL) 

30 

20 

1 

0.  C.  RX  Bus  Driver 

15 

5 

0 

0.  C.  Bus  (to  IV) 

15 

10 

0 

Scale  Multiplexer 

15 

10 

0 

Tristate  Driver 

30 

15 

1 

Output  (point  to  point)  Propagate  (1.5V) 

12 

8 

1 

External  Data  Propagate 

10 

5 

1 

External  Clock  Skew 

10 

5 

- 

Walt  for  Clock  (Margin) 

43 

102 

- 

RAM  Write  Time 

60 

60 

240 

240 

240 

Dump  X or  R to  T, 


X Register  (includes  CML  to  EFL) 

30 

20 

0 

0.  C.  RX  Bus  Driver 

25 

15 

1 

0.  C.  Bus  (to  IV) 

30 

15 

1 

Scale  Multiplexer 

15 

10 

1 

Tristate  Driver 

23 

10 

0 

Output  (point  to  point)  Propagate  (1.5V) 

12 

8 

0 

Reset  Override  Gate 

30 

10 

0 

Set-up  D,  T,  A,  S 

36 

22 

0 

External  Clock  Skew 

10 

5 

Walt  for  Clock  (Margin) 

29 

125 

Clock  Delay  ( Margin) 

40 

30 

Hold  (60  Minimum)  

80 

120 

360 

360 

360 

Transfer  Elements,  External 


Low  Speed 

t 

t 

t 

LOAD  D,  T,  S 

alloc 

max 

typ 

L 

External  Clock  Skew 

10 

5 

54LS174  Data  Register 

30 

20 

54L157  Data  Multiplexer 

54 

36 

Reset  Override  Gate 

30 

10 

Set-up  D,  T,  A,  S 

36 

22 

Walt  for  Clock  (Margin) 

180 

40 

87 

Clock  Delay 

40 

30 

Hold  (60  Minimum) 

120 

69 

90 

300 

300 

300 

High  Speed 

LOAD  D.  T,  S 

External  Clock  Skew 

10 

5 

54S174  Data  Register 

17 

12 

Reset  Override  Gate 

30 

10 

Set-up  D,  T,  A,  S 

36 

22 

Walt  for  Clock  (Margin) 

120 

27 

71 

Clock  Delay 

40 

30 

Hold  (60  Minimum) 

120 

80 

90 

240 

240 

240 

Dump  X or  R to  TTL 

X Register  (Includes  CML  to  EFL) 

30 

20 

0 

0.  C.  RX  Bus  Driver 

25 

15 

1 

0.  C.  Bus  (to  IV) 

30 

15 

1 

Scale  Multiplexer 

10 

10 

1 

Trl state  Driver 

23 

10 

0 

Output  (point  to  point)  Propagate  (1.5V) 

12 

8 

0 

Set-up  54LS174 

30 

20 

0 

External  Clock  Skew 

10 

5 

- 

Walt  for  Clock  (Margin) 

70 

137 

240 

240 

240 

144 
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TABLE  15 


Process  Elements 


Load  A and  Add  or 

t 

t 

t 

L 

Load  P and  Add 

al  loc 

max 

typ 

P Register  Including  CML  to  EFL 

30 

20 

1 

Add/Subtract  Multiplexer 

15 

10 

0 

Adder  Driver 

25 

15 

1 

Add  Propagation 

112 

94 

1 

CML  X Register  Set-up 

15 

15 

0 

Internal  Clock  Skew 

10 

2 

- 

240 

207 

156 

Load  S and  Add 

S Register 

60 

35 

1 

Adder  Driver 

15 

7 

0 

Add  Propagation 

112 

94 

0 

CML  X Register  Set-up 

15 

15 

1 

Internal  Clock  Skew 

10 

2 

- 

240 

212 

153 

Change  Add  to  Subtract 

External  Clock  Skew 

8 

5 

- 

54S174  Control  Register  (CADDT) 

17 

12 

0 

Signal  Propagation 

3 

1 

0 

AU  Control  Set-up 

35 

22 

1 

t Multiplexer 

30 

19 

- 

Adder  Driver 

15 

7 

- 

Add  Propagation 

112 

94 

- 

CML  X Register  Set-up 

15 

15 

- 

240 

235 

175 

145 


PROCESS  ELEMENTS 


Enter  k 


C2CLK 

0 

0 

External  Clock  Skew 

8 

5 

- 

54LS295  Configuration 

70 

47 

0 

(Tristate) 

Configuration  Set-up  (k) 

55 

35 

0 

180 

133 

87 

Chanqe  k 

C1CLK  + Q 

0 

0 

Configuration  to  mux 

30 

20 

0 

Inverter  mux  Control 

22 

15 

1 

K Multiplexer 

15 

10 

0 

Product  Driver 

25 

15 

1 

Product  Propagation 

168 

140 

1 

CML  R Register  Set-up 

15 

15 

Internal  Clock  Skew 

10 

2 

360 

285 

217 

Load  D or  T and  Multiply 

D Register 

60 

35 

1 

Product  Driver 

25 

15 

1 

Product  Propagation 

168 

140 

1 

CML  R Register  Set-up 

15 

15 

1 

Internal  Clock  Skew 

10 

2 

- 

360 

278 

207 
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TABLE  16 


MINIMUM  CLOCK  STROBE 

& MAXIMUM  CLOCK  CODE 

DELAYS 

LEADING  EDGE 

CLOCK 

^nln 

CODE 

*max 

L 

Delay  to  0°  Phase  Clock 

60 

54S140  Driver  Delay 

54194  Delay 

6 

17 

AU  (C2CLKF)  Clock  Iriv. 

AU  Clock  Decod' 

10 

39 

(0-1) 

TOTAL 

73 

56 

Minus  Code  Total 

- 56 

Difference  Margin 

17 

TRAILING  EDGE 

Code  Width  120 

Delay  to  0’°Clock  60 

54S140  Driver  Delay  6 

54194  Delay  17 

AU  Clock  Inverter  7 (1-0) 

AU  Clock  Decoder  39 

Clock  Width  60 

TOTAL  133  176 

Minus  Clock  Total  -133 

Difference  Margin  43 
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TABLE  17 

MAXIMUM  CLOCK  STROBE  & MINIMUM  CLOCK  CODE  DELAYS 
LEADING  EDGE 


CLOCK 

CODE 

*min 

*max 

Delay  to  0°  Phase  Clock 

60 

54S140  Driver  Delay 

8 

54194  Delay 

10 

AU  (C2CLKF)  Clock  Iriv. 

20 

AU  Clock  Decoder 

22 

■ ■■■■ 

TOTAL 

92 

28 

Minus  Code  Total 

-28 

Difference  Margin 

64 

TRAILING  EDGE 

Code  Width 

120 

Delay  to  0°Clock 

60 

8 

54S140  Driver  Delay 

10 

54194  Delay 

20 

AU  Clock  Inverter 

IS 

AU  Clock  Decoder 

Clock  Width 

60 



TOTAL 

145 

148 

Minus  Clock  Total 

- 145 

Difference  Margin 

3 

(0-1) 


o-o) 
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TABLE  18 


TYPICAL  CLOCK  STROBE-CLOCK  CODE  DELAYS 
LEADING  EDGE 


Delay  to  0°  Phase  Clock 
54S140  Driver  Delay 
54194  Delay 
AU  (C2CLKF)  Clock  Inv. 
AU  Clock  Decoder 

TOTAL 

Minus  Code  Total 
Difference  Margin 

TRAILING  EDGE 

Code  Width 
Delay  to  0°Clock 
54S140  Driver  Delay 
54194  Delay 
AU  Clock  Inverter 
AU  Clock  Decoder 
Clock  Width 

TOTAL 

Minus  Clock  Total 
Difference  Margin 


CLOCK 

Snin 

CODE 

^max 

L 

60 

8 

15 

15 

30 

(0-1) 

83 

45 

- 45 

38 

120 

60 

8 

15 

10 

20 

o-o) 

60 

138 

155 

- 138 

17 

i 
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4.2.2  Register  Clearing 

The  zero  override,  RSTF,  forces  the  Interpretation  of  Input  data  to  a 
logical  zero  at  all  Input  registers.  The  A,  T,  D,  P from  d inputs,  S,  and 
C registers  constitute  the  resetable  set.  A clock  code  In  conjunction  with 
a clock  generates  one  clock  to  these  registers  and  the  register/s  designated 
by  the  clock  code  are  cleared.  The  codes  provided  are  A,  T,  D,  P,  or  S alone 
and  T and  0 or  P and  S together.  The  zero  override  also  forces  the  +2047/2048 
override  feeding  the  A and  T registers  to  zero. 

The  initialization  clear  is  held  through  one  microprogram  cycle  pro- 
pagating zeroes  through  the  pipeline  at  start-up.  The  pro- 
grammed clear  Is  used  to  reset  an  accumulator  after  n Iterations  and  is  intro- 
duced at  the  Inputs  during  the  dump  output  transfer  element.  The  programmed 
clear  should  be  applied  60  nanoseconds  prior  to  the  clock  period  and  follow 
that  period  for  60  nanoseconds  to  reset  a particular  register  or  input  regis- 
ter pair.  The  configuration,  C,  register  may  be  independently  reset  with  the 
application  of  the  C1CLK  and  RSTF.  The  set-up  and  hold  times  given  previously 
also  apply  to  the  C register. 

4.2.3  Register  + 1 Data  Override 

When  the  function  T*D+k-A  is  to  be  used  as  (+l*D+kA)  the  Z configuration 
flip-flop  overrides  the  t(sn  - 11)  input  with  +1,  2047/2048;  and  the  T and  D 
register  clocks  may  be  generated  to  load  the  multiplier  with  the  addend  or  sub- 
subtrahend into  D simultaneously  with  +1  into  T.  The  Z configuration  flip-flop 
must  be  returned  to  its  original  state  by  a high  Input  at  S09  and  a C1CLK  clock. 

4.2.4  The  -1/2048  Default  Input  Value 

The  t(sn  - 11),  d(sn  - 11),  and  s(sn  - 11)  data  Inputs  are  pulled  high 
(5.0V)  by  a pullup  resistor  at  the  input  to  the  TTL  buffers  when  the  Inputs  are 
open.  This  eliminates  the  external  provisioning  of  pullup  resistors  provided 
in  high  speed  system  design  on  unused  Inputs  and  permits  selective  grounding 
of  s inputs  for  constant  configuration  when  those  data  inputs  are  not  used. 

It  also  provides  an  opposing  data  pattern  to  reset  override  for  dynamic  testing. 


* . 


4.2.5  Product  Significance 

The  product  of  the  two  values  held  In  the  T and  D registers  Is  truncated 

or  rounded  to  12  most  significant  bits  Including  sign,  depending  on  the  state 

of  the  round,  r,  configuration  flip-flop.  The  round  carry  Is  Inserted  Into  the 
-12 

13th  bit  (2  ).  The  11  least  significant  bits  of  the  product  may  be  accessed 

by  programming  the  TXL  clock  code  after  loading  the  multiplier  and  multiplicand. 
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The  SPAU  1$  designed  to  provide  a high  utility  In  through- 
put when  applied  as  a continuous  function  to  a data  stream.  It  is  designed 
to  be  electrically  compatible  with  any  one  of  the  five  TTL  families,  L-TTL, 
LS-TTL,  TTL,  H-TTL,  and  S-TTL,  directly  Interfaced.  The  SPAU  may 
tie  onto  a tristate  bus  or  be  configured  into  a direct  point-to-point  logic 
network.  The  internal  functions  and  some  interconnect  controls  are  config- 
urable through  an  internal  configuration  holding  register.  The  remainder  are 
classified  as  high  rate  functions  and  Interconnect  controls,  and  are  serviced 
directly  from  the  microprogram  control.  The  clocking  of  events  within  the 
arithmetic  unit  Is  controlled  by  a clock  code  obtained  directly  from  micro- 
program control  and  strobed  by  a free  running  easy  to  use  symmetrical  low  fre- 
quency (8.33  MHz  maximum)  clock. 

The  functional  features  include  a 12  X 12  bit  multiplier,  two  twelve 
bit  adders  with  two's  complement  add/subtract  control,  product  truncation  or 
rounding  to  the  twelve  most  significant  bits,  two  12  bit  accumulator  registers, 
four  12  bit  operand  holding  registers  with  synchronous  programmed  reset,  and 
a 12  bit  post  operative  scaler.  An  internal  bus  links  the  product  and  adders, 
provides  a fast  accumulator,  and  satisfies  a high  speed  FFT  quarter  butterfly 
or  low  speed  half  butterfly  capability. 


4.1  Signal  Processing  Architecture 


The  arithmetic  unit  . SPAU  is  incorporated  in  a complex  arithmetic 
microprocessor  Illustrated  in  Figure  10.  A minimum  of  two  complex  arithmetic 
microprocessors  is  needed  to  configure  a sequential  mode  FFT  butterfly.  Figure 
11,  where  each  SPAU  serves  as  a "half-butterfly".The  most  efficient  pipelined 
FFT  butterfly  is  depicted  in  Figure  12.  It  is  configured  from  two  complex 
arithmetic  microprocessors  each  containing  two  arithmetic  unit  LSI's,  each 
serving  as  a "quarter-butterfly". 


4.2.1  Register  Clocking 

The  arithmetic  SPAU  contains  input  and  output  holding  registers  as 
shown  In  Figure  1.  Data  Is  loaded  Into  these  registers  with  the  absence  of 
a reset  or  preset  override  when  one  or  more  clock  decoder  gates  of  7 Is  enabled 
permitting  the  free  running  clock  to  pass  through  to  the  register/registers. 
Clocks  are  enabled  according  to  the  Table  4 from  clock  code  inputs  QCLKOT  ot 
QCLK3T. 
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4.3  TTL  Family  Compatibility 

The  signal  processing  arithmetic  unit  (SPAU)  is  designed  to  be  compatible 
within  the  specified  limits  of  the  five  families  of  TTL.  Mixing  of  the  fam- 
ilies In  a design  requires  the  adoption  of  design  rules.  The  SPAU  compatibil- 
ity Is  not  without  reservation,  hovever.  SPAU  inputs,  for  example,  require 
from  4 to  8X  the  normal  (40pA)  source  current  of  TTL. for  Vj^  and  a reduced 
sink  current  for  VjL.  A 50KD  base  resistor  at  VjL  typically  provides  20pA 
current  for  the  input  TTL  emitter  follower  buffer  and  "80uA  source  current 
into  the  TTL  output  sink  at  the  data  inputs.  Control  inputs  require  a larger 
source  current.  The  SPAU  input  fan-in  Is  limited  by  the  sourcing  capability 
of  TTL. 

L and  LSTTL  have  low  noise  margins  and  consequently  should  not  be  mixed 
into  a system  utilizing  a lot  of  noisy  and  high  noise  margin  S-  or  H-TTL. 

It  Is  expected  that  the  low  noise  generated  by  L or  LS-TTL  is  more  compatible 
with  the  Internal  CML  (Current  Mode  Logic)  within  the  SPAU.  An  LSI  partition 
of  an  existing  system  should  translate  as  many  MSI,  S and  H-TTL  to  L and  LS-TTL 
(exctpt  In  the  clock  code  area).  The  L and  LS-TTL  families  have  200  and  300 
millivolt  noise  Imnunltles  for  V j at  +125°C  respectively.  The  SPAU  output 
source  and  sink  capability  is  in  keeping  with  the  L and  LS-TTL  fan-in  require- 


ments. 


The  VUL  maximum  of  .3V  is  under  L and  LS-TTL  load  conditions  to  a 


fanout  of  four. 

Table  19  is  a representative  example  of  cross  TTL  compatibility  at  a 
single  gate  level.  It  Is  only  useful  in  component  selection  in  iMplementing 
a preliminary  LSI  partition 

All  a.c.  loads  are  rated  at  a maximum  of  50  picofarads  which  translates 
to  maximum  "daisy  chain"  of  six  single  stations  (clocks  and  some  MSI  to  three 
double  stations)  and  ten  picofarads  of  interconnect.  The  propagation  times 
for  these  a.c.  loads  are  listed  in  Table  19  over  temperature.  These  times  are 
Intended  for  use  in  worst  case  over  temperature  designs  and  apply  to  single 
positive  NAND  gate  structure  delays  from  1.5V  Input  to  1.5V  output.  As  a 
design  practice,  establish  the  propagation  delay  at  50pf  for  those  devices 
selected  and  use  this  in  delay  averaging  without  permitting  the  a.c.  load  to 
exceed  50pf.  Budget  7 picofarads  per  input  load  (a  Dual  J-K  has  a 4 clock  load 
of  28pf),  10  picofarads  per  foot  for  #30  wtre  wrap,  3 picofarad  per  inch  for 
28  mil  laminants  and  interconnect  logic.  The  fanouts  given  In  Table  19  assume 
3"  of  interconnect  on  a multllamlnant  board.  The  50  picofarad  load  is  a max- 
Inum  a.c.  fanout  for  the  arithmetic  unit  TTL  tristate  outputs.  A maximum  of 

15  picofarads  is  assumed  for  each  "off"  tristate  driver  on  a tristate  bus. 
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A high  speed  register  should  not  fanout  to  a low  speed  register  If 
both  are  clocked  by  the  same  clock,  as  a general  rule,  since  the  swltchlnq 
time  of  the  high  speed  register  may  be  many  time's  less  than  the  data  hold 
time  of  the  low  speed  register.  The  L and  LS-TTL  MSI  registers  are  "slow" 
registers.  The  design  must  guarantee  that  the  minimum  delay  through  lo- 
gic between  fast  and  slow  exceeds,  In  all  cases,  the  hold  time  plus  max- 
imum clock  skew  of  and  to  the  "slow"  register,  when  a high  speed  register 
fans  out  to  a low  speed  register.  The  SPAU  LSI  has  a mix  of  high  and  low 
speed  registers  In  It;  this  wa*  dore  for  reasons  of  area  and  power  economy. 

In  Die  process  of  mlcro-codlnn  the  arithmetic  unit  clock  codes,  allow  suf- 
ficient hold  time  for  the  arithmetic  unit  A,  0,  S,  and  T registers;  and 
under  no  circumstances,  allow  these  to  be  clocked  simultaneously  with  a 
source  register  In  a transfer-element  where  they  are  a destination  register. 

The  clock  strobing  the  clock  decoder  In  the  arithmetic  unit  Is  con- 
tlnous  which  demands  a "no-clock"  code  for  holding  registers  during  a clock. 
The  clock  codes,  Implemented,  enable  the  clock  gate  for  any  single  arith- 
metic unit  register,  groups  of  registers,  or  no  register  (all  held).  The 
sequence  of  clock  codes,  comprising  the  microprogram,  enables  clock  pulse 
"wells"  to  latch  Input  data  streams  Into  the  A,  0,  S,  and  T registers  and 
enables  clock  edges  to  capture  output  result  data  Into  the  edge  triggered 
P,  R,  and  X registers.  The  code  sequence  of  the  microprogram  Is  derived  from 
the  timing  of  transfer-elements  and  process-elements,  given  In  section  3.0 
of  this  specification,  to  accomplish  a particular  function.  One  clock  code 
Is  required  per  clock  period  and  the  minimum  clock  period  Is  established  at 
120  nanoseconds  for  worst  case  designs. 

A twenty  percent  Incrrise  In  throughput  can  be  achieved  for  systems 
requiring  best  commercial  practices  and  operating  In  a controlled  25°C 
environment  using  the  typical  valn*s  given  In  section  3 of  this,  specification. 
The  structure  of  the  arithmetic  unit  LSI  does  not  preclude  Its  Increased 
capability  In  a relaxed  thermal  environment. 
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Output  source  current  Into  ground  short 


APPENDIX  B 


DETAILED  LAYOUT  AND  CIRCUIT  DESIGN  OF  SPAll  1 

Due  to  Us  complexity  and  the  many  disjoint  functions  Incorporated  Into 
the  SPAll  1 design, the  layout  of  this  chip  represented  a formidable  challenge. 

The  basic  technology  employed,  the  triple  diffusion  process,  also  has  a 
restriction  to  use  only  single  level  of  metallization  for  both  power  and 
signal  bus.  Although  TRW  maintains  and  Is  developing  a CAD  effort  In  LSI, 

It  was  clear  that  none  of  these  were  sufficiently  advanced  to  handle  a 
problem  of  the  magnitude  offered  by  the  SPAll  1.  The  SPAl)  1 called  for  a custom 
layout  design,  however,  certain  standardizations  were  employed  which  reduced 
the  manual  task  and  a maximum  use  was  made  of  computer-designer  Interactive 
aids,  the  Appllcon  system. 

Standardization  was  used  In  the  size  of  the  circuit  cell  and  the  power 
bus  interface  with  these  cells.  The  circuit  cell  Is  rectangular  with  signal 
entry  and  exit,  on  all  four  sides.  This  facilitated  the  two  dimensional  multi- 
plier block  of  foglc,  the  largest  electronic  segment  on  the  chip.  Figure 
2 shows  the  SPAll  Layout  Plan  and  the  circuit  cell  dimensions  used.  The 
overall  chip  dimensions  are  315  by  351  mils.  In  the  list  of  figures  included 
in  Appendix  B,  the  schematic  diagram  of  all  circuit  cells  are  shown.  Also 
shown  In  each  cell  Is  the  approximate  placement  for  all  Input  and  output 
signals  and  any  routing  of  control  lines  through  the  cell. 

The  type  of  logic  circuits  used  can  be  determined  from  the  schematic 
diagrams.  All  inputs  and  outputs  are  TTL  ljglc  compatible,  but  not  necessarily 
Identical  as  can  be  determined  from  a review  of  the  SPAll  Specification.  The 
IFF  (emitter  follower  logic)  Is  most  often  used,  but  other  logic  forms  are  used 
where  there  are  circuit  reasons  for  doing  so.  EFL  Is  used  In  the  D and  T 
registers  and  extensively  In  the  multiplier  array.  The  multiplier  array  uses 
non -threshold  AND-OR  gates  without  level  restoration  through  the  entire  array. 

>r  h one  bit  product  point  In  the  array  both  true  and  complement  output 
•Mi',  for  Both  carry  and  sum  outputs  are  fully  Implemented  (so  called  due,! 

• the  final  dual  rail  outputs  are  compared  via  a differential  ampli- 

• .• < ■ 'ea.K  gutte  fortuitously  Into  CML  (current  mode  logic)  logic.  Ihe 

if  k is  consequently  Implemented  with  CML  circuits  and  level 
1 eve  Is  hv  a Norton  mirror  circuit  (see  Cell  R and  Cell  X). 


1h*» 
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In  the  figures  of  the  circuit  cells,  at  the  upper  right  hand  corner  the 
number  of  devices  used  (a  device  Is  a transistor,  resistor,  or  diode)  and  the 
average  power  consumption  In  milliwatts.  The  spatial  distribution  of  power 
Is  shown  In  Figure  3 and  the  device  count  in  Figure  4 These  are 
respectively  5 watts  (nominal,  25°C)  and  13,796  devices  for  the  SPAU.  Power 
bus  current,  thermal  Interface,  diffusion  parameters,  and  package  connections 
are  shown  In  Figures  5 through  9 , respectively.  This  Is  followed 

literally  by  Cells  A through  Z In  the  figures  of  this  Appendix. 
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LIST  OF  FIGURES  - SPAU  1.  APPENDIX  B 


FIGURE  TITLE 

1 LSI  Implementation  Arithmetic  Unit  Logic  Schematic 

2 SPAU  Layout  Plan  12  Bit  Word  Length 

3 Power  Distribution  12  Wide  SPAU 

4 Dev  ce  Count  12  Bit  Wide  SPAU 

5 VCC2  Bus  Current  Distribution 

6 SPAU  Thermal  Interface 

7 Diffusion  Parameters  for  SPAU 

8 SPAU/12  Package  Connections 

9 CELLA 

10  CELLAA 
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12  CELLAD 

13  CELLADS 

14  CELLAU 
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19  CELLCDB 

20  CELLCX 
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30  CELLFX 

31  CELLFXC 

32  CELLKMU 
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APPENDIX  C 

Model:  TDCI003J 


The  TDC1003J  is  a multifunction  arithmetic  unit 
capable  of  performing  12  x 12  multiplication  as  well 
as  product  accumulation.  It  has  an  additional  feature 
of  permitting  the  accumulator  contents  to  be  sub- 
tracted from  the  next  product  instead  of  being  added, 
if  desired.  Input  registers  are  provided  in  addition 
to  the  product  accumulation  register. 

The  TDC1003J  is  directly  implementable  as  the 
central  building  block  for  digital  filters,  particularly 
FFTs,  for  complex  multipliers,  and  for  recursive  and 
nonrecursive  filter  elements. 


FEATURES 

• 12x12  Bit  Parallel,  Two's  Complement 
Multiplication 

• Controllable  Accumulation  Either  + or  — 

• 175  nsec  Typical  Multiply  and  Accumu- 
late Time 

• Much  Lower  Power/Faster  Speed  than 
Equivalent  MSI  Multiplication— Accumu- 
lation Systems 

• Round  Control 

• 27-Bit  Accumulation  Capacity 

• Single  Chip,  Bipolar  Technology 

• Asynchronous  Mode  Multiply 

• Radiation  Hard 

• TTL  Input  and  Output 

• Three  State  Outputs 

• Single  Power  Supply,  +5  Volts 

• Dual  In-Line  Package  or  Flat  Pack 

• 2.5  Watts  Power  Consumption 


MULTIPLIER-ACCUMULATOR 
PARALLEL  12-BIT 


PRELIMINARY 
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absolute  maximum  rating*  over  operating  temperature  range 


Supply  voltag* -0.6  to  7.0  V 

Input  voltae* 0 to  6.6  V 

Output  voltag* 0 to  6.6  V 

Operating  tamparatur*  range 0°C  to  70*C 

Storage  temperature  rang* -66° C to  160°C 

lead  temperature  (10  e*conde) 300° C 

Junction  temperature 175°C 


recommended  operating  condition* 


TDC1003 

UNIT 

MIN 

NOM 

MAX 

Supply  voltage, 

4.6 

6.0 

6.6 

V 

Clock  putt*  wkttti  (waaeured  at  1 .6  V level) 

26 

o» 

Input  rogieter  eel  up  time.  r$  (eee  Figure  1 ) 

6 

ns 

Input  icgliter  hold  time,  rH  (eee  Figure  t ) 

IS 

ns 

Operating  ambient  temperature 

0 

70 

t 

electrical  characteristics  over  recommended  temperature  range 


PARAMETER 

tcst  rnisniTimn 

TDC1003 

UNIT 

MIN 

TYP 

MAX 

V.H 

High-level  Input  voltage 

20 

V 

VIL 

Low  level  Input  voltag* 

0A 

V 

VOH 

High-level  output  voltage 

V(X  ■ NOM, 

'oh  " mA 

2.4 

2.7 

V 

o 

> 

Low  lewl  output  voltae* 

Vcc  - MIN. 

lOL  - 4.0  mA 

OJ 

0.6 

V 

•in 

High  Iwd  Input  current 

Vcc-MAX. 

V.H-2A 

-2 

78 

mA 

>IL 

Low  level  input  current 

Vcc  ’ MAX- 

v.L-a* 

-6 

-76 

mA 

'in 

Clock*1 

Vcc  - MAX. 

V,h-2.4 

78 

mA 

'IL 

Clock** 

V^.  - MAX. 

v,l-m 

-0.78 

mA 

'cc 

Supply  current 

Vcc-nom 

BOO 

760 

mA 

At  T h|||1|  - »°C.  Va  - NOM.  * Clock  Me  two  equlvelent  dock  Input  load*. 

switching  characteristics,  ■ 5.0,  - 25°C  (*•*  Figure  1 ) 


PARAMETER 

TEST  CONDITIONS 

MIN 

TYP 

MAX 

UNIT 

Multiply  accumulate  tima,  input 
ragistar  clock 

Saa  Figure  6 

160 

176 

ns 

To  output  regtiter  dock, 

Output  delay 

Load  1 , fee  Figuree  3,  6 

40 

60 

ns 

rD 

Three  net*  output  delay 

Output  enable 

Load  2.  eee  Figuree  4, 6 

40 

50 

ns 

Output  diaabl* 

Load  2.  eee  Figure*  4, 6 

30 

40 

ns 
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Figure  1.  Timing  Diagram 


Figure  2.  Input  Schematics 
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Figure  3.  Output  Delay  Versus  Temperature 


Figure  4.  Three  State  Delay  Versus  Temperature 
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Figure  5.  Multiply  and  Accumulate  Time  Versus  Temperature 


NORMAL  LOAD 


TO 

OUTPUT 

PIN 


THREE-STATE  DELAY  LOAD 


WV  II 

*n., 

Pr  -r  T o « w 


> 810  W 

> TO 

> OUTPUT 
PIN 

^ 40 

L IN3062 


LOAD  1 ± LOAD  2 

■r 
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CONTROLS  DESCRIPTION:  TDC1003J 


ACC,  SUB,  and  RNO  are  loaded  into  registers  by  either  CLKX  or  CLKY. 

ACC:  When  the  ACC  signal  is  low,  the  next  product  clocked  into  the  MSP  and  LSP  accumulating  registers  has 

zero  added  to  it,  i.e.,  it  is  the  first  product  in  a series  to  be  summed.  The  ACC  signal  is  then  brought  high. 
Subsequent  products  are  then  accumulated  in  the  product  registers.  If  accumulation  is  not  desired,  the  ACC 
is  placed  in  the  low  position:  the  TDC1003J  then  functions  as  a standard  multiplier. 

SUB:  When  the  SUB  signal  is  high,  the  accumulator  contents  are  subtracted  from  the  next  product  and  the  dif- 

ference is  then  stored  in  the  output  registers.  When  low,  the  accumulator  contents  are  added  to  the  next 
product  (straight  accumulation).  The  SUB  control  is  enabled  by  ACC.  SUB'  « (SUB  • ACC). 

RND:  When  RND  is  high,  the  quantity  2‘12  (for  fractional  2s  complement  field,  see  FORMAT, page  6)  is  added  to 

the  next  product. 

TRIM,  TRIL:  Non-registered  three  state  buffer  controls:  'O'  = enable,  'I'  * disable. 


TYPICAL  OPERATING  SEQUENCE 


SUM  OF  PRODUCTS 


CLKX 

CLKY 


X OR  Y 
INPUT 


ACC 


CLKP 


LOAD  Xv  Y, 

words  into 
input  registers. 
Also,  load 
ACC  * 0 into 
control 
register 


LOAD  X,  . Y, 

product  into 
output  registers. 
Also,  load 
X2,  Yj  into 
input  register. 
Also,  load 
ACC  « 1 into 
control  register 


X2  • Y2  product 
added  to  Xj  • 

and  stored  in  out- 
put registers.  Also, 
load  X3,  Yj  into 
input  register. 

Also,  load  ACC  - 
1 into  control 
register 


NOTES:  1.  SUB  * 0 for  sequence  above. 

2.  Xn,  Yn  are  12-Bit  Two’s  Complement  Numbers. 
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X3  * Yj  product 
added  to  [(X2Y2)  + 
(XjY.|)]  and  stored 

in  output  registers. 
Also,  load  X4,  Y^ 

into  input  registers, 
etc. 
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FORMAT: 
(NOTE  S) 


ft  COMPLEMENT  FRACTIONAL  NOTATION 


' INPUT  X|gN  x_2  x_3  x_4  X-5  X-8  X_7  X— 8 X-9  X-10  X-ll 

•ON  2_1  2~3  2-3  7~*  2~6  2~®  2-7  2-8  2-®  2-10  2-11 
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TYPICAL  APPLICATIONS  OF  TDCI003J 


DIGITAL  FILTER  FOR  NUMERICAL  INTEGRATION  OF  SAMPLED  DATA 

Based  on  the  area  under  a parabolic  arc,  Simpson's  rule  is  an  accepted  method  for  numerical  integration  of  a sampled  data 
sequence. 

Let  a function  y(t)  be  sampled  at  n + 1 points,  such  that  yg,  yn  are  equally  spaced  at  an  incremental  interval  T. 

Assume  that  n is  even.  Then  according  to  Simpson's  rule  (see  almost  any  calculus  textbook),  the  area  Ag  under  the  curve 
y(t).  given  by 


may  be  approximated  by 

A$"$<Vo  + 4Vl  ♦2V2  + 4V3  + 2y4  + ...  + 4yn.i  + Vn>-  (2) 

This  is  generally  more  accurate  than  the  so-called  trapezoidal  rule 

AT  - $ <V0  ♦ 2y,  + 2y2  ♦ ...  + 2yn.,  + yn),  (3) 

which  approximates  the  function  y(t)  by  straight-line  segments  and  therefore  fails  to  take  account  of  curvature. 

An  accumulatidn  of  the  terms  in  Equation  (2),  therefore,  implements  Simpson's  rule  explicitly,  where  it  is  necessary  only  to 
input  the  sequence  of  sampled  points  and  the  appropriate  sequence  of  weighing  coefficients.  After  any  step  m,  where  m<n, 
the  contents  of  the  accumulator  are  X_,  which  is  an  approximation  to  the  running  integral  up  to  that  point.  When  m * n 
and  the  accumulation  is  terminated  with  the  proper  weighting  coefficient  (see  Note),  the  evaluation  is  complete  and 

An-V 


BLOCK  DIAGRAM  (Uses  one  TDC  1003J) 


V,.  V„  . 


T 4T  2J 

3’  3'  3’ 


Operation  Sequence 


Contents  of  Output 
Registers  At  End 
of  Operation 


V0*3’Al 
A,  +V,  *§“X2 
A2  + v2  * IT  " A3 

A3  + V3  * ^ “ ^4 


NOTE:  To  avoid  termination  error,  based  on  Simpson’s  rule  outlined  above,' the  integration  should  terminate  on  an  odd 
number  of  samples  (n  even)  with  a weight  of  T/3,  as  shown.  If  it  is  necessary  to  terminate  on  an  even  number  of  samples 
(n  odd),  then  it  is  a good  approximation  to  keep  the  sequence  up  to  that  point  and  terminate  with  a weight  of  2T/3. 
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NONRECURSIVE  FILTER 


Using  one  TDC1003J  in  time  sequenced  operation. 


Operation  Sequence 


Step  Load  ACC  SUB  RND 


Xt-1h1 
Xt  2'  h2 


xt+1-  h0 


Xt-V  h2 
Xt  2'  h3 


Contents  of  Output  Registers 
at  Completion  of  Operation 


Cumulative  Time 
At  Completion 
of  Operation 
(nsl 


Vh0 

X,*h0  + XM*h1 

Vh0  + XM*h1+Xt-2*h2 

Xt  * h0  + Xt-1  * h1  + Xt-2  " h2  + Xt-3  * h3 


Vi  * h0 
Vi  *h0  + x,*h, 

Xt+1  * h0  + xt  " h1  + Xt  1 * h2 

Xt+1  * h0  + Xt  * h1  + Xt  1 * h2  + Xt-2  * h3  = XOUT, 


800 

cycle  complete 


cycle  complete 


PIPELINED  NONRECURSIVE  FILTER  USING  FOUR  TDC1003J,  5 MHz 


In  this  method,  four  TDC1003J  parts  are  used  in  a parallel-out  kind  of  connection.  An  external  four-stage  circulating  shift 
register  holds  the  weighting  functions,  h.  A tag  bit  is  also  circulated  in  the  h shift  registers  which  operates  the  3-state  control, 
thereby  busing  the  accumulator  contents  to  the  output  in  sequence.  Input  register  clocking,  output  register  clocking  and 
h shift  register  are  all  operated  directly  from  the  5 MHr  system  clock.  The  control  ACC  is  also  operated  from  the  tag  bit  as 
well  as  the  3 state  control.  As  can  be  traced  from  the  block  diagram,  each  TDC1003J  accumulates  four  products  and  then  is 
gated  to  the  output  bus.  On  the  next  clock  period  the  adjacent  TDC1003J  is  outputted,  etc.  By  these  means,  steady  outputs 
at  the  5 MHt  rate  are  sustained.  The  latency  period  is  four  clock  periods  or  800  ns.  The  internal  operation  sequence  for  any 
one  TDC1003J  is  the  same  as  shown  in  the  previous  implementation. 
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SINGLE  TOC 1003  J PART  IMPLEMENTATION 
SIMPLE  RECURSIVE  FILTER  OF  SINGLE  POLE  FILTER 

> case,  two-step  operation  is  used  where  the  operand  into  the  multiplier  is  alternated  between  1 and  h and  the  output 
I out  on  alternate  clock  cycles.  Operation  is  shown  below,  assuming  initial  output  accumulator  contents  of  vaiue  V. 


Operation  Sequence 

Step 

Load 

ACC 

SUB 

RND 

TRIM 

Initial  state  tQ 

1 

XOUT.  • h 
*0 

1 

0 

0 

0 

2 

Vv 1 

1 

0 

0 

1 

3 

XOUT,+1’h 

1 

0 

0 

0 

4 

V*1 

1 

0 

0 

1 

Contents  of  Output  Registers 
at  Completion  of  Operation 


Cumulative  Time 
of  Completion 
of  Operation 


XOUT,  ~v 
l0 


XOUTt+1  * Xt+1  + htl 
(read  out  to  destination) 


h(xt+1  + hu) 


X..,  + h<Xt+1  +hu 


(read  out  to  destination) 


400 

Cycle  complete 
600 
800 

Cycle  complete 


For  a two  or  more  pole  filter  we  have  the  following  diagram: 


xOUT. 

*0 

XOUTt+,  * xt+1  +V 

XOUTt+2  ” xt+2  + h1  <Xt+1  + h1w*  + h21' 

XOUTt+2  * Xt+3  * h1  (xt+2  + h1  *Xt-H  + hl''*  + h211'  + h2  *xt+1  + h11’* 

Xq^jj  ^ + hj  X^+2  + hj  (xr+2  + ^*1  1 + h^ul+hjt’]  + h2  (X(+^  * h^u)  + h2  (X^+2  + h^  (X^+j  + h^u)  + h^'l 

In  general 

XOUT,  * Xt  + ht  ,XOUTt  * h2  ,XOUTt.2*  + hn  <XOUTt  n* 

XOUT,  ’ xt  + 2 hi  <xOUTt  ) 

Using  the  methods  shown  for  the  other  example,  the  two  pole  filter  can  be  stepped  at  600  ns  intervals  using  one  TDC1003. 
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Operation  Sequence 

Contents  of  Output  Registers 
at  Completion 
of  Operation 

Cumulative  Time 
at  Completion  of  Operation 
(ns) 

Step 

Load 

ACC 

SUB 

RND 

y,.y2 

0 

0 

n 

VY2 

200 

xvx2 

1 

1 

N 

xrx2-vv2-x 

(transfer  to  destination) 

400 

x,.y2 

0 

0 

II 

VY2 

600 

4 

x2.Yi 

1 

0 

0 

X1*Y2  + X2*Yl'Y 
(transfer  to  destination) 

800 

Using  two  TDC1003s,  it  is  obvious  that  the  above  complex  multiplication  can  be  performed  in  400  nsec. 


FFT  DECIMATION-IN-TIME  KERNEL  SEQUENCE 


Fk  - 53  »nWnK.  K«  0.1.2. 
n-0 


Implementation  Solution: 

X^  = X , + X2  cos  9 + Yj  sin#  ■ + Z^ 

- X2  tin  9 + Y2  cosfl  = + Z2 

Xj  * X ] • X2  cos  9 - Yj  sin  9 * X1  - Zj 
Yj  * Y1  + X2  sin  9 - Yj  cos  9 • Y^  - Z2 
FFT  points  represented  by 
X,  +j  Y,.  X2  + i Y2 

Transformed  Vectors 

Xl.X^Yi.Yj 

e — 2*K/N 


w * Z,  = x2cos 0 + Y2  sind 

yyK  s #-j2»K/N  - #-jd  ■ cos0  - j sind  ^2  - -X2  sind  + Y2  cosd 
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FFT  DECIMATION-IN-TIME  KERNEL  SEQUENCE  (CONTINUED) 
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Using  one  TDC1033J  in  time  sequence  operation. 


Operation  Sequence 

Contents  of  Output 
Registers  at  Completion 
of  Operation 
(Note  1) 

Cumulative  Time 
at  Completion  of 
Operation  (ns) 

(Note  3) 

Step 

Load/Hold  (L/H) 

ACC 

SUB 

RND 

1 

L 

X2,  cos  0 

0 

0 

0 

Xj  cos  0 

200 

2 

L 

Y 2 , sin  0 

1 

0 

0 

X2  cos  0 + Y2  sin  0 = Z 1 

400 

3 

L 

Xv  1 (Note  4) 

1 

0 

0 

X1+Z1=Xl' 

(transfer  to  destination) 

600 

4 

H 

Xv  1(Note  2) 

1 

1 

0 

-Z1 

800 

5 

H 

X,.1 

1 

0 

0 

X1 *Z1  = X2 
(transfer  to  destination) 

1000 

6 

L 

X2,  sin  0 

0 

0 

0 

X2  sin  0 

1200 

7 

L 

Y2,  cos  0 

1 

1 

0 

-X2  sin  0 + Y2  cos  0 = Z2 

1400 

8 

L 

Y,,1 

1 

0 

0 

Y1+Z2  = Yl' 

(transfer  to  destination) 

1600 

9 

H 

YV1 

1 

1 

0 

1 

N 

ro 

1800 

10 

H 

YV1 

1 

0 

0 

y1-Z2  = Y2' 

(transfer  to  destination) 

2000 

Note  1.  The  outputs  are  stable  within  Tq  time  after  clocking  the  output  registers,  typically  30  ns.  Outputs  are  held  stable 
until  the  next  step  in  the  sequence  is  clocked  at  the  end  of  the  step.  Consequently  output  contents  are  available 
to  bus  to  the  destination  for  rm  - tq  period,  approximately  170  ns  for  rm  = 200  ns. 


Note  2.  At  steps  4 and  9,  if  the  constant  loaded  is  +2  instead  of  +1,  the  above  sequence  can  be  shortened  by  two  steps 
or  to  1600  ns.  Some  loss  in  accuracy  may  result  using  this  shortened  sequence  since  the  number  field  would  have 
to  include  +2  and  a quantity  such  as  cos  6 could  only  be  represented  with  one  less  significant  binary  bit. 

Note  3.  Using  two  TDC1003J  parts  instead  of  one  will  allow  the  kernel  operation  in  one-half  the  time  given  above. 

Note  4.  The  absence  of  a true  +1  in  the  fractional  2s  complement  number  field  causes  a small  error  to  be  introduced.  An 
alternative  is  to  arrange  the  algorithm  so  that  -1  is  used  as  the  operator  since  this  is  a valid  2s  complement  num- 
ber having  unity  absolute  value.  Another  alternative  which  avoids  the  error  is  to  use  a one  bit  integer  field  with 
sine  and  cose  scaled  accordingly. 


TRW  RESERVES  THE  RIGHT  TO  CHANGE  PRODUCTS  AND  SPECIFICATIONS  WITHOUT  NOTICE.  THIS 
INFORMATION  DOES  NOT  CONVEY  ANY  LICENSE  UNDER  PATENT  RIGHTS  OF  TRW  INC.  OR  OTHERS. 
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APPENDIX  D 
SPAU  2 CIRCUIT  CELLS 

The  circuit  cell  layout  for  SPAU  2 Is  shown  In  Figure  3.4.5  of  the  main 
report  text.  Most  of  the  central  core  of  cells  used  for  the  multiplier  are 
the  same  as  used  for  SPAU  1 and  displayed  In  Appendix  B.  In  this  appendix 
only  tiie  unique  cells  are  shown. 

figure  1 shows  the  circuit  schematics  used  for  the  Input  Cells,  I NXP, 

INYD,  RND,  ACC,  and  SUB.  The  major  change  over  the  SPAU  1 input  cells  was  to 
place  a buffer  amplifier  on  the  Input.  This  placed  a fixed  delay  into  the 
Input  to  compensate  for  the  clock  buffer  delay  and  thereby  create  a near 
zero  setup  time  for  the  Input  registers. 

Figure  2 shows  new  clock  buffers  and  Intermediate  power  drivel's.  These 
have  much  improved  speed  over  the  former  used  in  SPAU  1.  The  power  OR  driver 
Is  unique.  The  output  is  push-pull  and  drives  a control  line  3/4  around  the 
chip  In  5 ns . 

Figure  3 shows  the  circuit  cell  for  LSB  logic  1 injection  for  2's  comple- 
ment arithmetic.  This  Is  standard  TTL  which  Interface  with  the  gate. 

Figure  4 shows  the  output  cell  schemat.es.  These  are  modified  for  an 
Improved  mirror  circuit  making  the  logic  translation  from  CMl.  to  TTL. 

Figure  5 shows  the  output  register  along  with  the  accumulation  controls 
and  feed-back  to  the  adder.  This  operates  totally  using  CM..  The  adder  feed- 
back couples  to  EFL  circuits  directly. 

The  layout  of  these  cells  producing  the  SPAU  2 drawing  was  accomplished 
using  Appllcon  software  and  computer.  The  method  Is  basically  custom  design; 
ht'wrver.  a standardized  cell  size  was  used.  The  cell  size  accommodates  the 
’argest  cell.  In  some  places  space  utilization  could  be  much  Improved;  however, 
•.  ii  in  yield  JO  process  places  no  particular  need  to  reduce  chip  size.  Ap- 
utilization  can  be  seen  from  the  chip  photograph  In  the  text, 

*r»  > 4 ft  . ..  • 
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APPENDIX  E 

SPECIFICATION  FOR  SIGNAL  PROCESSING  DELAY  LINE 

INTRODUCTION 

The  signal  processing  delay  line,  SPDL,  3D  current  mode  logic  (CML)  LSI 
Is  TTL  compatible  at  Inputs  and  outputs.  The  SPDL  functionally  provides  the 
capability  for  parallel  Iterative  FFT  processor  address  sequencing  In  a pipe- 
lined mode.  Each  SPDL  Is  composed  of  two  shift  register  sections,  shown  In 
Figure  1.  The  upper  section  Is  used  for  upper  address  sequencing  and  the 
lower  section  Is  used  for  lower  address  sequencing  relative  to  the  FFT  butter- 
fly algorithm.  The  addresses  of  data  points  are  Input  as  time  multiplexed 
upper  and  lower  addresses  at  the  A input  and  bit  reversed  time  multiplexed 
upper  and  lower  addresses  at  the  B Input.  The  upper  and  lower  addresses  are 
steered  to  their  reopectlve  shift  register  sections. by  appropriately  clocking 
the  corresponding  shift  register.  The  upper  point  addresses  are  clocked  by 
ACLF  Into  the  upper  shift  register,  designated  by  subscript  u In  Figure  1; 
and  the  lower  point  addresses  are  clocked  by  BCLF  Into  the  lower  shift  regis- 
ter designated  by  subscript  L In  Figure  1.  The  upper  output  multiplexer 
selects  addresses  n-4  and  n-2,  where  n Is  the  address  being  generated  and 
stored  Into  SPDL  from  SPAC.  The  lower  output  multiplexer  selects  addresses 
n-4  and  n. 

1.0  FUNCTIONAL 

The  SPDL  is  composed  of  two  shift  register  sections  each  containing  two 
multiplexers  and  a five-bit  shift  register.  The  Inputs  are  time  multiplexed 
on  A,  B,  or  selected  Independently  using  the  INSEL  control  from  A or  B.  The 
tristate  A and  B outputs  may  be  tied  together  for  a common  address  destination 
by  using  complementary  OUTAEN  and  OUTBEN  signals  or  used  Independently. 

The  output  multiplexer  In  each  section  provides  selection  from  a tap  or 
output  from  the  end  of  the  shift  register. 

1 . 1 The  Input  Multiplexer 

Input  addresses  are  shifted  from  TTL  up  to  CML  levels  at  the  A and  B 
Inputs.  Each  Input  drives  two-bit  slices  of  the  SPDL.  The  Input  requirement 
Is  listed  in  Table  1.  The  Input  multiplexer  TTL  control  line  INSEL  drives 
twelve-bit  slices.  The  Input  requirement  is  listed  in  Table  1. 
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TABLE  1 . INPUT  CHARACTERISTICS 


SIGNAL 

PARAMETER 

INA  or 

Low  Level  Input  Voltage 

INB 

High  Level  Input  Voltage 

Low  Level  Input  Current 
VILD=  ‘3  v0^ts 

High  Level  Input  Current 

W 3*4  volts 

Setup  Time 

Data  Stable  to  Clock 

INSEL 

Low  Level  Input  Voltage 

High  Level  Input  Voltage 

Low  Level  Input  Current 

V i ld=  *3  v°lts 

High  Level  Input  Current 
fIHD=  324V 

Setup  Time 

Control  Stable  to  clock 

MNEMONIC 

MIN 

TYP 

MAX 

lint 

VILD 

0.4 

0.8 

Volts 

V I HD 

2.0 

3.4 

Volts 

1 ILD 

1 

cn 

o 

o 

1 

•*>1 

o 

o 

mA 

!ihd 

50 

70 

uA 

tsud 

20 

ns 

V I LC 

0.4 

0.8 

Volts 

VIHC 

2.0 

3.4 

1 ILC 

-7.0 

-9.0 

mA 

1 IHC 

70 

90 

uA 

Tsuc 

30 

ns 

| 

LOGIC 

FUNCTION 

0 

False  Address 

1 

True  Address 

INSEL 

0 

Select  A Input 
(INA) 

1 

Select  B Input 
(INB) 
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There  are  twelve  five-bit  CML  shift  registers  in  the  SPDL,  one  shift 
register  per  bit  slice.  Each  slice  has  its  own  dedicated  clock  driver  driving 
its  shift  register.  The  input  specification  for  the  clock  load,  driving  six 
clock  drivers  is  listed  in  Table  2.  The  shift  register  advances  on  the  nega- 
tive going  edge  clock  at  the  input  of  the  SPDL.  Each  shift  register  section 
is  clocked  by  an  independent  input  clock.  The  characteristics  of  the  CML 
registers  used  in  the  shift  register  is  given  in  Table  2. 

1 .3  The  Output  Multiplexers 

Each  of  the  two  CML  shift  register  sections  is  tapped  differently  to 
provide  access  at  different  delays.  These  taps  are  selected  using  the  output 
multiplexers  in  each  section.  The  tap  selections  are  listed  in  Table  3 along 
with  the  multiplexer  to  output  characteristics. 

1 .4  Output  Tri state  Drivers 

The  output  drivers  are  enabled  to  a tri state  bus  by  applying  a low 
input  at  the  OUTEN  lines.  A high  input  selects  the  outputs  at  a section  off. 
The  characteristics  of  the  output  drivers  are  listed  in  Table  4. 

2.0  CHARACTERISTIC  OPERATING  RATINGS 

The  SPDL  operates  over  the  margins  listed  in  Table  5. 
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TABLE  2.  SHIFT  REGISTER  CHARACTERISTICS 


SIGNAL 


PARAMETER 


MNEMONIC 


MIN 


TYP 


MAX 


UNTT 


ACLF  or 
BCLF 


Low  Level  Input  Voltage 


High  Level  Input  Voltage 


Low  Level  Input  Current 


High  Level  Input  Current 


Propagation  Time  Clock 
Into  Flip-Flop 


Clock  Asynmetry  VTnD  at 
10  MHz 


ILP 


IHP 


1 ILP 


‘IHP 


PDLHP 


'ILSM 


2.0 


0.4 


3.4 


-3.5 


40 


15 


50 


0.8 


-4.5 


50 


25 


60 


mA 


uA 


ns 


CML  D 
F/F 
Q1-»Q5 


Propagation  Time  Q output 
from  clock  (H  or  L) 


Clock  Frequency 


PDQC 


'AC 


10 


10 


16 


ns 


MHz 


MHz 


Application  Clock  Frequency 


6 


SIGNAL 


STATE 


REGISTER  TAP 


ALEVSEL 

0 

(n-2)u 

1 

(n-4)u 

BLEVSEL 

0 

nL 

1 

(n-4)L 

SIGNAL 

PARAMETER 

MNEMONIC 

MIN 

TYP 

MAX 

UNIT 

ALEVSEL 

High  Level  Input  Voltage 

VIH 

2.0 

3.4 

Volts 

BLEVSEL 

Low  Level  Input  Voltage 

VIL 

.4 

.8 

Volts 

High  Level  Input  Current 

!ih 

40 

50 

Low  Level  Input  Current 

!IL 

-3.5 

-4.5 

mA 

Propagation  Time  select 

15 

22 

ns 

output  law 
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TABLE  4.  OUTPUT  DRIVER  CHARACTERISTICS 


SIGNAL 

PARAMETER 

MNEMONIC 

MIN 

TYP 

MAX 

UNIT 

0UT(1-6)A 

or 

OUT ( 1 -6)B 

Output  High  Voltage 

V0HD 

2.4 

3.5 

V 

Output  Low  Voltage 

V0LD 

0.4 

0.6 

V 

Output  High  Current 

V0H=  2-4V 

!0HD 

-200 

-1000 

uA 

Output  High  Current 

V0L=  0.4V 

!old 

6 

mA 

Propagation  Time 
Multiplexer  Input  to 
Driver  Output 
(High  to  low)  ON 

50  pf  800  0 

tpohl 

22 

30 

ns 

Propagation  Time 
Multiplexer  Input  to 
Driver  Output 
(Low  to  high)  ON 

50  pf  800  » 

tpnlh 

30 

45 

ns 



Propagation  Time 

Driver  Input  select 
to  driver  output 
(OFF  to  LOW-  Vcc  -V0U) 

50  pf  800  0 

tponhl 

40 

60 

ns 

SIGNAL 

LOGIC 

FUNCTION 

OUTAEN  or 

0 

Select  output  ON 

OUTBEN 

1 

Select  outpit  OFF 
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INTRODUCTION 


The  signal  processing  address  control  (SPAC)  3D  CMl-EFl  chip  Is  TTl  Input 
and  output  compatible.  The  SPAC  generates,  on  start  preset,  three  address 
sequences  for  time  domain  parallel  Iterative  FFT  kernel  proceseors.  These 
address  sequences  are  the  upper  butterfly  address  sequence,  the  lower  butter- 
fly address  sequence,  and  the  coefficient  address  sequence.  The  SPAC  or 
SPAC's  are  dynamically  programmable  to  accept  a converted  log..N  code  for  FFT  * s 
of  N points.  SPAC's  may  be  cascaded  to  acconmodate  extensions,  E,  of  2*ft  + ^ 
points  per  SPAC.  That  Is,  one  SPAC  will  accommodate  32  points  and  E-0,  two 
SPAC's  will  accommodate  1024  points  and  E-l  extension,  and  three  SPAC's  will 
accommodate  up  to  32,768  points.  The  SPAC  LSI  set  also  generates  a last  level 
signal  used  to  orderly  terminate  each  FFT  kernel  process. 

The  coefficient  addresses  are  always  generated  left  Justified  with  the 
first  thetas  (®)  equal  to  ^ or  * Independent  of  the  number  of  points  (N)  pro- 
grammed. The  progressive  Increased  resolution  with  Increasing  level  number 
Is  accomplished  using  carry  Injection  to  succeedlngly  lower  significant  coef- 
ficient counter  stages  by  means  of  a propagating  ones  shift  register. 

The  upper  and  lower  butterfly  addresses  are  always  generated  right  Just- 
ified selecting  Incremental  points  across  a spread  dependent  on  the  level 
number.  The  resultant  upper  and  lower  butterfly  addresses  are  time  multiplexed 
and  transmitted  to  both  real  and  Imaginary  FFT  memories  directly  fer  sequential 
FFT  operation  or  Indirectly  through  SPDL  sequencer  and  then  to  memory  for  FFT 
pipelined  operation.  The  SPAC  has  detectors  for  end  of  soread  and  level.  It 
uses  coablnatorlal  logic  In  conjunction  with  a propagating  ones  shift  register 
to  flag  the  proper  level-spread  operation. 

The  block  diagram  of  the  SPAC  Is  presented  In  Figure  1 and  this  Is 
Interpreted  In  3D  logic  In  the  logic  schematic,  Figure  2. 

1.0  FUNCTIONAL 

The  SPAC  receives  a converted  log2N  program  code,  a start  preset,  a free 
running  clock,  an  output  select  line,  and  carrles-ln  linking  other  cascaded 
SPAC's.  The  SPAC  transmits  five  bits  of  coefficient  address  (left  justified 
to  the  MSB),  five  bits  of  time  multiplexed  upper  and  lower  address  bits,  and 
carrles-out  linking  other  cascaded  SPAC's. 
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The  converted  log2N  code  defined  In  Figure  3 Is  Input  Into  the  SPAC  and 
decoded  to  select  an  j address  sequence  for  both  the  upper  and  lower  butterfly 
address  counters,  and  log2N  levels  of  execution.  The  characteristics  of  the 
log2N  ®r  PROG  1,  2,  and  3 Inputs  are  listed  In  Table  1. 

1.2  Clock  Input 

The  clocking  of  registers  Is  performed  using  Internally  and  externally 
generated  controls  which  enable  the  free  running  input  clock  to  count  load, 
or  shift  registers.  The  clock  distribution  is  shown  in  the  logic  diagram. 

Figure  2.  The  Input  characteristics  of  the  clock  are  listed  in  Table  2 and 
Figure  4. 

1.3  Output  Select 

The  output  select  provides  the  means  of  time  multiplexing  the  upper  and 
lower  butterfly  addresses  on  one  set  of  five  lines.  The  characteristics  of  the 
output  select  are  listed  In  Table  3 and  Figure  5. 

1.4  SPAC  Carries 

Each  of  the  five  registers/counters  in  one  given  SPAC  may  be  cascaded 
Into  adjoining  SPAC's,  using  the  carry  input  and  output  lines  provided.  The 
rate  of  address  sequence  generation  Is  dependent  upon  the  rippling  of  carries 
from  the  Initial  counter  stage,  through  intervening  SPAC's,  to  the  last  SPAC 
and  the  input  to  the  counter.  In  general,  theee  is  a pair  delay  added  into  the 
carry  cascade  for  each  additional  SPAC  In  the  cascade.  The  characteristics  of 
each  carry  is  listed  In  Table  4.  In  general,  all  carry  outputs  use  TTL  totem 
pole  drivers  capable  of  sinking  the  equivalent  of  tow  power  Schottky  TTL  (LS-TTL). 

1.5  Coefficient  Address  Outputs 

The  coefficient  address  outputs  are  derived  from  the  coefficient  address 
counter  and  output  using'TTL  totem  pole  drivers,  capable  of  sinking  the  equiva- 
lent of  low  power  Schottky  TTL,  (LS-TTL).  The  characteristics  of  these  outputs 
ine  listed  in  Table  5. 

1 .6  Upper  and  Lower  Butterfly  Address  Outputs 

The  upper  and  lower  FFT  butterfly  addresses  are  time  multiplexed  on  the 
SPAC  address  lines  using  tristate  drivers  which  are  capable  of  sinking  the 
equivalent  of  lowepower  Schottky  TTL  (LS-TTL).  The  characteristics  of  the  tristate 
driver  ire  listed  in  Table  6. 
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TABLE  1.  PROGRAMING  INPUT  CHARACTERISTICS 
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REVISIONS 


onciitmoM 


THE  PROGRAM  COOE  ENTERS  THE  FFT  POINT  SIZE  (N)  AS  N/2  IN  THE  FOLLOWING  FORMAT. 

0 ALL  BITS  IN  THE  5-BIT  SLICE  ARE  ZERO.  INCLUDING  THE  MOST  SIGNIFICANT  BIT  OF  THE 
LOWER  ORDER  SLICE. 

1 THE  MSB  OF  THE  NEXT  LOWER  ORDER  SLICE  IS  A ONE. 

2 1ST  BIT  OF  THE  5-BIT  SLICE 

3 2ND  BIT  OF  THE  5-BIT  SLICE 

4 3RD  BIT  OF  THE  5-BIT  SLICE 

5 4TH  BIT  OF  THE  5-BIT  SLICE 

| 5TH  BIT  OF  THE  5-BIT  SLICE 


10  9 8 7 6 5 4 3 2 1 


10  BIT  WORD  SIGNIFYING  £ 

N ~n 

? 2 


LSI  2 
10  9 


7 6 


5 1 4 I 3 | 2 | 1 | 0 

6 5 4 3 2 1 

n i L 


BREAKING  THE  WORI  UP 
INTO  2 6-BIT  OVERLAPPING 
FIELDS 


6 5 4 3 2 1 


BIT  POSITION 
ON  IN  OCTAL 


BIT  POSITION 
ON  IN  OCTAL 


P3  P2  PI 


P3  P2  PI  | | 


THESE  2 FIELDS  ARE  RECODED 
IN  THE  LSI  INTO  6-BIT  FIELDS. 


EXAMPLE:  IF  J ■ 2*  -10000/00000 

0 0 0 0 0 0 

1024  PT  FFT  100000 
N/2  • 512  ■ 2»  6 i 1 

6 0 


j THE  PROGR 

i 68,  Oe 


PROGRAM  CODE  BECOMES: 
8.  08  for  2 LSI's 


MCI  M8K  • MIOONOO  HACK  CALIFORNIA 


LSI  ADDRESS  CONTROL 
PROGRAMMING  CODE 


TABLE  2.  INPUT  CLOCK  CHARACTERISTICS 


\ SIGNAL 

PARAMETER 

MNEMONIC 

MIN 

TYP 

MAX 

KEM 

CLOCK  F 

Input  H1;h  Voltage 

V I HC 

2.0 

3.4 

V 

Input  Low  Voltage 

VILC 

0.4 

0.8 

V 

Input  High  Current 

1 IHC 

.2 

.4 

mA 

Input  Low  Current 

1 ILC 

-3.0 

-3.8 

mA 

Clock  Assymmetry 

Ac 

50 

60 

% 

Clock  Frequency 
(Three  cascaded  SPAC's) 

fc 

5 

< 5 

MHz 

Propagation  Time 

Input  to  Register 

tpdc 

25 

40 

ns 

SIGNAL 

LOGIC 

EDGE 

FUNCTION 

CLOCK  F 

0 

Falling 

Clock  Upper  and  Lower, 
Coefficient  Counters 

1 

Rising 

Clock  Level  Shift  Register, 
Coefficient  Shift  Register 

TABLE  3.  OUTPUT  SELECT  CHARACTERISTICS 


SIGNAL 

PARAMETER 

MNEMONIC 

MIN 

TYP 

MAX 

UNIT 

ADRSEL  MCT6 

Input  High  Voltage 

VIHS 

2.0 

3.4 

Bjl 

PRESET 

Input  Low  Voltage 

VILS 

0.4 

0.8 

Input  High  Current 

!ihs 

0.3 

mA 

Input  Low  Current 

1 ILS 

-3.C 

K 

mA 

Propagation  Time, 
(Lower  Counter  to 
Output) 

(50  pf  IK  load) 

tpdlcs 

25 

40 

nsec 

Propagation  Time, 
(Upper  Counter  to 
Output) 

(50  pf  25K  load) 

tpducs 

35 

50 

nsec 

SIGNAL 

LOGIC 

FUNCTION 

A0RSELMCT5 

0 

Select  Lower  Address  Counter 
to  Output  Address 

1 

Select  Upper  Address  Counter 
to  Output  Address 

TABLE  4.  CARRY  CHARACTERISTICS 


SIGNAL 

PARAMETER 

MNEMONIC 

MIN 

TYP 

MAX 

UNIT 

CICOEF, 

Input  High  Voltage 

VIHR 

2.0 

3.4 

V 

SHINT,  LEVINF, 
CSPRINF,  CIADDF, 

CILOF,  CIUPF 

Input  Low  Voltage 

0.4 

0.8 

V 

VILR 

SPREADF,  CSHINT 

LEVELF 

Input  High  Current 

VIHR 

50* 

100* 

uA 

Input  Low  Current 

1 ILR 

-.5* 

-.75* 

mA 

Propagation  Time, 
Input  Inverter 

TPDRHL 

7 

10 

ns 

Propagation  Time, 
Input  Inverter 

tpdrlh 

15 

20 

ns 

CSHOUT  , COCOEF, 
SHOUT,  CLEVOF* 

Output  High  Voltage 

V0HR 

2.4 

3.5 

V 

CSPROF 

Output  Low  Voltage 

V0LR 

0.4 

0.6 

V 

COADD F 

COLOF 

COUPF 

Output  High  Current 
VOHR=  2-4V 

!ohr 

-1 

mA 

Output  Low  Current 

r0LR 

4* 

mA 

Propagation  Time, 
Output  Driver 

20  pf  4K 

TPDR0 

15 

20 

nsec 

*COLOF , CLEVOE,  & CSPROF  require  IQLR*  6 mA 


SIGNAL 

LOGIC 

FUNCTION 

CSHINT 

1 

Identify  of  Most  Significant  SPAC 
for  6 * it  Coefficient  Address 

CSHINT 

CSHOUT 

Carry  In  from  Carry  Out  of 

Coefficient  Shift  Register 

CICOEF 

COCOcT 

Carry  Into  Coefficient  Address  Counter 
from  Carry  Out  of  lower  arder  SPAC 
Coefficient  Address  Counter 

*SPREADF  & LEVEL  F INPUT  LOADS  ARE  DOUBLED 

CILOF  292 


TABLE  4 (cont'd) 


SIGNAL 

LOGIC 

FUNCTION 

SHINT 

SHOUF 

Carry  Into  Level  Shift  Register  from  Carry 

Out  of  higher  order  SPAC  level  shift  register. 

CLEVINF 

CLEVOF 

Carry  Into  Level  Bus  from  Carry  Out  of  lower 
order  SPAC  level  detector. 

CSPRINF 

CSPROTF 

Carry  Into  Spread  Bus  from  Carry  Out  of  lower 
order  SPAC  Spread  Detector. 

LEVELF 

CLEVOf 

All  SPAC's  receive  as  Inputs  the  most  signifi- 
cant SPAC  level  output. 

SPREADF 

CSPROF 

All  SPAC's  receive  as  Inputs  the  most  signifi- 
cant SPAC  spread  output. 

CIADDF 

COADDF 

Carry  Into  Lower  Counter  Adder  from  lower  order 
SPAC  lower  counter  adder. 

CILOF 

COLOF 

Carry  Into  Lower  Address  Counter  from  lower 
order  SPAC  Lower  Address  Counter. 

CIUPF 

COUPF 

Carry  Into  Upper  Address  Counter  from  lower 
order  SPAC  Upper  Address  Counter. 

SHOUT 

1 

Coefficient  shift  register  out  from  higher 
order  SPAC. 

COCOEF 

0 

Cumulative  propagating  carry  from  lower  order 
coefficient. 

CLEVOTF 

0 

Cumulative  propagating  carry  from  level 
detection  cascade  In  lower  order  SPAC's. 

SHOUT 

0 

Level  shift  register  output  from  higher  order 
SPAC 

CSPROTF 

0 

Cumulative  propagating  carry  from  spread 
detection  cascade  In  lower  order  SPAC's. 

COADDF 

0 

Cumulative  propagating  carry  from  lower  order 
SPAC  lower  counter  adder. 

COLOF 

0 

Cumulative  propagating  carry  from  lower  order 
SPAC  lower  address  counter. 

COUPF 

0 

Cumulative  propagating  carry  from  lower  order 
SPAC  upper  address  counter. 
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TABLE  5.  COEFFICIENT  ADDRESS  OUTPUT/CHARACTERISTICS 


SIGNAL 

PARAMETER 

MNEMONICS 

MIN 

TYP 

MAX 

UNIT 

C0EF01T  to  COEFOST 

Output  High 

Voltage 

V0HF 

2.4 

3.5 

V 

Output  Low 

Vol tage 

V0LF 

0.4 

0.6 

V 

Output  High 

Current 

*0HF 

-1.0 

mA 

Output  Low 

Current 

!0LF 

6.0 

mA 



Propagation  Time 
Register  to  Output 
*(50  pf.  2.5K  ‘I) 

tpdf 

45 

60 

ns 

SIGNAL 

LOGIC 

FUNCTION 

C0EF01 

1 

Most  significant  bit  of  ® angle  for  SPAC 

coefficient  address  slice. 

C0EF05 

1 

Least  significant  bit  of  e angle  for  SPAC 

coefficient  audress  slice. 

*1 


±ama* 


1 


1 


2.0  MAXIMUM  RATINGS 

The  maximum  operating  ratings  for  the  SPAC  3D  LSI  are  given  In  Table  7. 

3.0  OPERATION 

The  addresses  generated  by  the  SPAC  for  a parallel  Iterative  512  point 
FFT  are  Included  In  Table  8,  as  derived  from  the  flow  chart.  Figure  6.  These 
addresses  are  used  to  access  real  and  Imaginary  data  points  and/or  Intermed- 
iate data  and  sine  and  cosine  FFT  coefficients  as  shown  In  the  representative 
FFT,  Figure  7. 

The  maximum  control  and  data  propagation  times  are  shown  In  Table  9. 
These  propagation  times  are  also  shown  In  the  SPAC  critical  timing  diagram 
(Fig.  8)  which  Is  used  for  two  SPAC's  connected  together.  The  Interconnect 
diagram  Is  shown  In  Figure  9 for  expanding  the  SPAC  from  a 5-blt  to  10-bit 
address  generation. 
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TABLE  6.  TRISTATE  ADDRESS  DRIVER  CHARACTERISTICS 


29 


TABLE  7.  MAXIMUM  OPERATING  RATINGS 


♦Preliminary  Estimates 
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SPREAD 

COUNTER 


UPPER 

LOWER 

ADDRESS 

ADDRESS 

0 

32 

31 

63 

64 

96 

im 


ADDRESS 


im 


rni 


224 

255  19  (5 


21  (6)  160  ( 5*78 


22  (5)  192  (3 


23  (5)  224  (7ir/8) 


24  (5)  16  ( it/  16) 


25  (5)  48  (3*716 


26  (5)  I 80  (5*/16) 


27  (5)  126  (7*716 


28  (5)-  144  (9*/16) 


864 

895  29  (5)  176  (11*/16 


30  (5 


31  (5)  240 


3*716 


SPREAD 

UPPER 

— i 

LEVEL(Level 

— ^ 

Dcoefficient 

COUNTER 

ADDRESS 

COUNTER 

ADDRESS 

| 

520 

96  (7) 

4 

1 ' 

wamm 

527 

ill 

8 

528 

536 

97  (7) 

12 

1 3tt 

15 

535 

543 

' 

1 8 

544 

552 

98  (7) 

20 

5tt 

15 

551 

559 

8 

560 

568 

99  (7) 

28 

1 7" 

15 

567 

575 

|ct 

8 

576 

100  (7) 

36 

1 9n 

1 

15 

583 

(5T 

) 

8 

592 

■■ 

101  (7) 

44 

11™ 

15 

599 

I 

sr 

8 

616 

102  (7) 

52  j 

13™ 

15 

615 

623 

1 

sr 

8 

624 

632 

103  (7) 

60  / 

15™ 

15 

631 

639 

1 

ET" 

8 

640 

648 

104  (7) 

68 

17™] 

15 

647 

655 

S4“ 

8 

656 

|V 

105  (7) 

76  [ 

19™ 

15 

663 

MM 

1 

64= 

8 

672 

106  (7) 

84/ 

21  ™ 

15 

679 

687 

1 

srj 

8 

688 

696 

10?  (7) 

92| 

23n  ' 

15 

695 

703 

1 

«r| 

8 

704 

712 

108  (7) 

100  / 

25n 

15 

711 

719 

1 

erj 

8 

■ lJ 

iig|g?sa| 

109  (7) 

loaf 

27™ 

15 

1 

1 

srl 

8 

| 

110  (7) 

116 

29™ 

15 

mBmt 

1 

5T 

8 

JEM 

HI  (7) 

1 24 

31™ 

15 

mzm 

1 

iPl 
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TABLE  9.SPAC  DELAYS 


i 


CLOCK  DKIVER  (CLOCKF  -•>  CLOCKT)  10  -►  30  ns 

for  LOWER  ADDRESS  COUNTER 


D FLIP  FLOP  PROPAGATION  TIME  (CLOCKT  -*•  FF  OUT) 

LEVEL  & SPREAD  SENSE  GENERATE  (FF  OUT  - LEVOUTF) 
includes 

FF  Carry  In  Delay  25  ns 

LOn  generate  30  ns 

Level  generate  15  ns 

Level  Output  Driver  30  ns 

100  ns 

LEVEL  & SPREAD  RROPAGATE  (LEVINF  -*  LEVOUTF) 

LOWER  ADDRESS  COUNTER  CARRY  PROPAGATE 

(CILOF  - COLOF) 

UPPER  ADDRESS,  COEFFICIENT  ADDRESS  COUNTER 

CARRY  PROPAGATE  (CIUPF  + COUPF)  or  (CICOEF  - COCOEF) 

ADDRESS  OUTPUT  DRIVERS  (FF  OUT  - ADDRb) 

2n  Adder  Delay  (CLOCKF^SHIFT  REQj  + COADD) 

1 stage  delay  only  through  adder 

COEF  SEL  to  COEF  ADDRESS  OUTPUTS 


10  30  ns 

50  100  ns 


30  60  ns 

30  ■+  60  ns 

20  -*•  100  ns 
10  60  ns 

30  ->  100  ns 

10  •*  60  ns 
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APFtNDIX  G 


LOW  POWER  T L LSI 
CONFIGURABLE  GATE  ARRAY 

(This  LSI  design  approach  is  used  in  SPAC  2 chip.) 

The  CGA  is  a universal  logic  array  developed  by  the  Microelectronics  Center 
of  TRW.  The  array  consists  of  158  TTL  NAND  elements  with  a variety  of  configura- 
tions available  for  each  gate,  independently.  Unused  elements  in  the  array 
consume  no  power.  A mask  programmed  signal  matrix  provides  a flexible  method 
for  gate  interconnection.  A maximum  of  38  pins  is  available  for  either  input 
or  output  signals. 

Gate  Configurations 

Each  gate  utilized  is  configured  by  selecting  a function,  an  output  type,  and  a 
power  level  from  columns  (A),  (B),  and  (C)  below  (certain  output  features  may  be 
combined  in  a single  gate). 


GATE 

CONFIGURATIONS 


Inverter 

2-  Input  NAND 

3-  Input  NAND 
Expander 

CGA2  Advantages 


OUTPUT 

TYPE 


Active  Pull-up 
Passive  Pull-up 
Diode  Exclusion  Clamp 
Isolation  Diode 


Standard: 

Driver: 


POWER 

OPTION 


Fan-out  = 10  (max) 
Fan-out  = 50  (max) 
or  any  output 


9 Direct  low  power  T L compatability  - for  new  logic  designs  or  as  a 
replacement 

• Fast  Turnaround  - Wafers  are  stockpiled  up  to  the  programmed  mask  level. 
Delivery  - 60  days  after  receipt  of  logic  diagram. 
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Operating  characteristics  (OVCR  FUU  Ort RATING  TEMPERATURE  RANGE) 


PARAMETER 

TEST  CONDITIONS 

MIN 

TYPICAL 

MX 

UNITS 

Operating  Cat*  Taaperatur* 

- J5 

♦ 100 

•c 

Supply  Voltage,  V£C 

4.5 

5.0 

5.5 

V 

High-level  Output  Voltage,  VQH 

WCC  * 45¥*  l0H  * *“x- 

2.4 

3.4 

V 

Low-level  Output  Voltage,  V^ 

Vcc  • 4.5V,  MX 

0.35 

0.5 

V 

High-level  Input  Voltage,  Vj(| 

2.0 

V 

Low-level  Input  Voltage,  VJt 

0.8 

V 

High-level  Input  Current,  IjH: 

Standard  Gat* 

V1H  • 2.0V 

40 

uA 

Driver  Gate 

V,H  • 7.0V 

120 

uA 

Low-level  Input  Current,  I.,  : 

Standard  Gate 

V(l-  0.8V 

-120 

uA 

Driver  Gate 

V,L  • 0.8V 

• 360 

wA 

High-level  Output  Current,  1qH: 

Standard  Gate 

Vcc  • «-5V.  Vw  • 2.4V 

-400 

uA 

Driver  Gat* 

VCC  ■ 4'5V  V0H  * ’'4¥ 

-2.2 

■A 

Low-level  Output  Current,  !«. : 

Standard  Gate 

V0l  • 0.5V 

1.0 

a* 

Driver  Gate 

6.0 

mA 

Propagation  time: 

Standard  Gate,  rpLH 

Rl  • 3.4K,  CL  • 40  pf 

45 

55 

ns 

a a . 

’PHI 

($e*  Figure  1) 

50 

65 

ns 

Driver  Gate,  »pL(, 

Rl  ■ 680.1, CL  ■ 200  pf 

35 

45 

ns 

• “ t 

’PHL 

(ie*  Figure  1) 

45 

60 

ns 

Power  Consumption: 

Standard  6*t*  ('O' ) 

0.5 

0.65 

nW 

■ * Cl') 

1.0 

1.3 

■at 

Driver  Gat*  ('0') 

1.3 

1.75 

•at 

* . (.r) 

7.1 

9.5 

■at 

Chip  s1«e:  105  x 205  nils 

Package:  40-pin  DIP  or  40-pin 

FIAT  PACK 
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PROGRAMMING  THE  CGA-2 

To  understand  the  CGA  approach  to  logic  design  each  part  of  the  chip 
should  be  discussed.  We  will  begin  with  the  gate  elements. 

LOGIC  GATES 

On  the  CGA-2  worksheet  (Figure  1)  are  120  long  rectangles  labeled  G101 
thru  G430  and  38  rectangles  around  the  edge  labeled  PI  thru  P38.  This  work- 
sheet Is  actually  a computer  plot  of  some  of  the  levels  of  the  CGA-2  chip  and 
the  rectangles  denote  the  locations  of  Internal  gates  and  pad  gates,  respectively. 

Figures  2 and  3 show  the  schematic  of  these  gates.  The  "X"  Indicates  dis- 
cretionary contacts  that  determine  the  function  of  the  gate.  (A  straight  line 
through  an  "X"  is  a continuous  conductor.  Intersecting  line*;  are  connected  with 
a programmed  contact.)  The  internal  and  pad  gate  circuits  are  Identical  and, 
though  the  contact  arrangement  varies  slightly,  all  circuit  functions  can  be 
Implemented  with  either  gate  type.  Possible  circuit  functions  are: 

o standard  gate,  open  collector,  3- Input  NAND 
o standard  gate,  active  pull-up,  3-1nJ>ut  NAND 
o three-input  gate  expander 
o 5K-ohm  pull-up  resistor 
o 25K-ohm  pull-up  resistor 

o Isolation  diode  for  use  with  pull-up  resistors 
o driver  gate,  open  collector,  3-Input  NAND 
o driver  gate,  active  pull-up,  3-Input  NAND 
o three-diode  excursion  clamp 

A few  restrictions  are  placed  on  combining  these  within  one  gate  location: 

o A gate  location  may  contain  only  one  gate  type  (standard,  driver,  or 
expander). 

o Excursion  clamps  are  not  allowed  with  active  pull-up  gates, 
o An  Isolation  diode  can  be  used  only  with  a pull-up  resistor, 
o Only  one  pull-up  resistor  (5fc  or  25K)  can  be  used. 

These  restrictions  place  no  serious  limitation  on  the  flexibility  of  the 
configurable  gate. 

The  configuration  options  can  be  specified  using  a configuration  code 
consisting  of  a letter  and  four  numbers.  The  letter,  P or  I,  Indicates  whether 
the  contact  pattern  defined  applies  to  a Pad  or  Internal  gate.  The  four  digits 
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are  an  octal  code  describing  the  combination  of  options  chosen.  These  are 
determined  from  the  CGA-2  Gate  map  (Figure  4)  as  shown.  The  first  example 
Is  a driver  gate  with  active  pull-up  and  3 Inputs.  The  second  Is  a driver 
gate  expanded  from  the  right  with  3 Inputs,  passive  pull-up,  and  25K  pull-up 
resistor  with  Isolation  diode  and  excursion  clamp. 

Once  the  contacts  are  selected  on  the  schematic,  they  are  readily  trans- 
ferred to  the  layout  diagram  (Figs.  5 and  6)  which  bears  a 1:1  component  and 
contact  correspondence  with  the  schematic,  but  the  contacts  are  arranged  physi- 
cally as  they  are  on  the  actual  gate  layout.  (Figs.  7 and  8)  After  checking, 
the  layout  Is  taken  to  the  Appllcon  Graphics  System  where  a cell  Is  created 
containing  only  the  contact  required.  The  corresponding  logic  symbol  Is  drawn 
In  the  cell  and  a label  Is  attached.  Figure  9 shows  the  contents  of  the  cell. 
The  contacts  are  emphasized  with  a cross  for  clarity  In  plots,  but  only  the 
square  contacts  themselves  will  be  employed  In  mask  generation.  A final  plot 
Is  run  with  the  cell  superimposed  on  a gate  cell  exactly  as  It  will  be  placed 
In  chip  programming  (Fig.  10).  The  plot  Is  checked  and  filed  for  reference, 
and  the  cell  is  stored  In  a tape  library  of  such  cells. 

When  a configuration  Is  first  used,  It  Is  again  checked  for  correctness 
and  the  circuit  operation  Is  verified  when  fabrication  Is  completed.  The  con- 
figuration Is  then  listed  as  verified  with  any  pertinent  performance  specifi- 
cations and  can  be  reused  In  future  designs  with  great  confidence. 

INTERCONNECT  MATRIX 

Again,  looking  at  the  worksheet  (Fig.  1 Is  a 50%  reduction  of  an  actual 
worksheet),  the  logic  gate  locations  just  discussed  are  recognized  and  a matrix 
of  interconnect  paths  Is  evident.  It  consists  of  diffused  tunnels  crossing 
longer  metal  runs.  Signals  may  enter  or  leave  Internal  gates  by  the  top  and/or 
bottom.  All  Inputs  are  on  narrow  tunnels,  to  reduce  parasitic  capacltance.whlle 
outputs  are  through  wide  tunnels  to  reduce  parasitic  resistance.  The  Inputs  to 
the  gates  are  labeled  A,  B,  and  C starting  with  the  Input  tunnel  nearest  the 
output  tunnel  for  that  gate.  Connection  between  a tunnel  and  a metal  line  Is 
made  by  specifying  a contact  hole  at  their  point  of  Intersection.  Connection 
may  be  made  from  a bonding  pad  to  the  output  of  the  gate  at  that  pad  site,  the 
"A"  Input  of  that  gate,  or  Into  the  matrix. 
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Though  the  layout  Is  principally  a matter  of  Intuition  and  trial  and 
error,  a few  general  guidelines  speed  the  process. 

• locate  Gates  Driving  Off-Chip  First 

In  an  effort  to  maintain  high  off-chip  drive  capability,  It  Is 
recommended  that  a gate  driving  off-chip  be  located  at  the  site  of 
the  pad  It  Is  driving  to  reduce  parasitic  resistance  between  the 
gate  and  the  pad. 

• Locate  Gates  Driven  From  Off-Chip  Next 

In  a similar  effort  to  reduce  parasitic  capacitance  on  Inputs,  try 
to  locate  an  Input  gate  next  to  Its  pad. 

• Layout  For  Minimum  Parasltlcs 

Keep  In  mind  that  metal  Is  much  preferable  to  tunnel.  In  most  cases, 
IR  drop  appears  to  be  a limiting  factor  rather  than  capacitance. 

Table  1 gives  some  representative  values  of  parasltlcs  for  the  various 
Interconnect  paths. 

An  example  of  the  layout  process  will  clarify  some  of  the  finer  points  In  the 
deslyn.  In  the  text  section  5.2,  Figure  5.2.2  shows  a logic  diagram  and 
Figure  5.2. 6 shows  the  completed  layout.  The  gate  n mbers  and  pads  have 
one  to  one  correspondence  for  the  two  drawings. 


321 


rear 


♦ ♦ ♦ 
♦ ♦ 


♦ 

♦ 


♦ 

♦ 


♦ ♦ 


♦ ♦ 


10037 

♦ ♦ ♦ 

♦ 

♦ ♦ 


Fig.  9 

Internal  Gate 
Configuration 


A _*i 


329 


-"'-iv _ • 


r 


CALCULATED  INTERCONNECT  PARASITICS 


1 cm 

1 line  width 
1 gate  width 

SHORT  BUS  TOP  OF  GATE  (3  Gates) 

LONG  BUS  TOP  OF  GATE  (6  Gates) 

SHORT  BUS  BOTTOM  OF  GATE  (14  Gates) 

LONG  BUS  BOTTOM  OF  GATE  (16  Gates) 

PAD  BUS  TOP 

PAD  BUS  TOP  OF  SIDE 

PAD  BUS  CENTER  OF  SIDE 

PAD  BUS  BOTTOM  OF  SIDE 

JUMPER 

INTERNAL  GATE  LENGTH 


1 line  width  (slim) 

1 line  width  (wide) 

TOP  OF  GATE  (SHORT) (SLIM) 
TOP  OF  GATE  (SHORT) (WIDE) 
TOP  OF  GATE  (LONG) (SLIM) 
TOP  OF  GATE  (LONG) (WIDE) 
BOTTC"  CF  GATE  (ELI") 
BOTTOM  OF  GATE  (WIDE) 

TOP  PAD  GATE  (SLIM) 

TOP  PAD  GATE  (WIDE) 

SIDE  PAD  GATE  (SLIM) 

SIDE  PAD  GATE  (WIDE) 
BOTTOM  PAD  GATE  (SLIM) 
BOTTOM  PAD  GATE  (WIDE) 
UNDER  POWER  BUS  (SLIM) 
UNDER  POWER  BUS  (WIDE) 
JUMPER  (2  METAL  WIDTHS) 
JUMPER  (3  METAL  WIDTHS) 


METAL 


CAP  (pF)  RES  (A) 


4.0 

29.0 

0.00056 

0.00406 

0.0056 

0.0406 

0.017 

0.122 

0.034 

0.244 

0.784 

5.684 

0.896 

6.496 

1.064 

7.714 

0.560 

4.060 

0.448 

3.248 

0.392 

2.842 

0.0168 

0.122 

0.252 

1.830 

TUNNEL 

CAP  (pF) 

RES  (A) 

0.05 

8.0 

0.13 

2.5 

0.30 

48.0 

0.79 

15.0 

0.40 

64.0 

1.05 

20.0 

1.10 

175.0 

2.88 

55.0 

0.55 

88.0 

1.44 

27.5 

0.90 

144.0 

2.36 

45.0 

0.60 

96.0 

1.57 

30.0 

1.15 

184.0 

3.00 

57.5 

0.10 

16.0 

0.15 

24.0 

TABLE  1 


335 

*U.S.Qov«rnm«nt  Printing  Office:  197*  — 757-090/721 


